BARM and BalticMicrobeDB, a reference metagenome and interface to meta-omic data for the Baltic Sea

The Baltic Sea is one of the world’s largest brackish water bodies and is characterised by pronounced physicochemical gradients where microbes are the main biogeochemical catalysts. Meta-omic methods provide rich information on the composition of, and activities within, microbial ecosystems, but are computationally heavy to perform. We here present the Baltic Sea Reference Metagenome (BARM), complete with annotated genes to facilitate further studies with much less computational effort. The assembly is constructed using 2.6 billion metagenomic reads from 81 water samples, spanning both spatial and temporal dimensions, and contains 6.8 million genes that have been annotated for function and taxonomy. The assembly is useful as a reference, facilitating taxonomic and functional annotation of additional samples by simply mapping their reads against the assembly. This capability is demonstrated by the successful mapping and annotation of 24 external samples. In addition, we present a public web interface, BalticMicrobeDB, for interactive exploratory analysis of the dataset.

The samples were sequenced at the National Genomics Infrastructure at Science for Life Laboratory, Stockholm, Sweden, using a full HiSeq 2500 high-output flowcell producing an average of 69.5 million pair-end reads per sample.
The redoxcline samples consist of samples from station Boknis Eck (Data Citation 4), located at the entrance of the Eckernforde Bay in the southwestern Baltic Sea, and from station TF0271 at the Gotland Deep in the eastern Gotland Basin. The Boknis Eck station was sampled on September 23, 2014 on the R/ V Littorina during routine monitoring activities performed monthly by the GEOMAR Helmholtz Centre for Ocean Research Kiel. Due to windy conditions before the sampling day, the water at the Boknis Eck station was mixed over most of the water column and only the bottom water was sulfidic. Water was sampled from the mixed oxygenated layer and from the sulfidic bottom water, which was captured on a 3 μm pore size membrane filters (Whatman, Maidstone, UK) followed by 0.2 μm pore size Sterivex-GV filters (Millipore Billerica, Massachusetts, USA). The Gotland Basin was sampled during the cruise EMB087 on the R/V Elisabeth Mann Borgese on October 18 and October 26, 2014. The samples from October 18 were taken in the context of an experiment close to the oxic-anoxic interface from suboxic and anoxic water layers and were captured directly on 0.2 μm pore size Durapore membrane filters (Whatman, Maidstone, UK). The samples from October 26 were taken to cover different zones in the redox gradient (suboxic, oxic-anoxic interface, upper sulfidic, lower sulfidic) and were captured first on a 3 μm pore size membrane filters (Whatman, Maidstone, UK) followed by 0.2 μm pore size Sterivex-GV filters. DNA from water captured on 3 μm pore size membrane filters and 0.2 μm Sterivex-GV filters was extracted using the QIAmp DNA Mini Kit (Qiagen, Hilden, Germany): ATL buffer was added to filter pieces together with 200 μm low-binding Zirconium beads (OPS Diagnostics, Lebanon, NY, USA) and the suspension was vortexed for 5 minutes at maximum speed. Subsequently proteinase K was added and the suspension was incubated for approximately 1h at 56°C before continuing DNA extraction by following the manufacturer's instructions. Nucleic acids from Gotland Basin water sampled on October 18 on 0.2 μm pore size membrane filters were extracted using the AllPrep DNA/RNA Mini Kit (Qiagen, Hilden, Germany). Similar as before, filters were vortexed together with Zirconium beads in RTL buffer before continuing nucleic acid extraction by following the manufacturer's protocol. The concentration and quality of the eluted DNA was assured by gel electrophoresis. The samples were sequenced on a single HiSeq 2500 lane producing an average of 20.7 million pair-end reads per sample.
All sequencing libraries (including LMO) were prepared with the Rubicon ThruPlex kit (Rubicon Genomics, Ann Arbor, Michigan, USA) according to the instructions of the manufacturer.

Preprocessing and Assembly
The quality of the reads were checked and visualized with FastQC (http://www.bioinformatics.babraham. ac.uk/projects/fastqc/) through MultiQC 22 and trimmed from low quality bases with cutadapt 23 using Phred score 15 as a cutoff. Adapter sequences were also removed using cutadapt, keeping only read pairs where both reads in the pair were longer than 31 bases. Preprocessed reads were then assembled using Megahit 24 version 1.0.2 with default parameters including kmers 21,41,61,81 and 99.
Exclusively to the 30 samples from the transect cruise, genomic material (20 ng per L of seawater) from a known genome of Thermus thermophilus (strain HB8), which is not expected to be present in the Baltic Sea naturally, was added after filtration but prior to the DNA extraction, serving as internal standard to enable absolute quantifications. Aligning all contigs from the metagenome assembly against this reference genome showed that 84.1% of the genome was recovered within contigs aligning with average 99.82% identity. These additional genome contigs were kept in the reference assembly but reads aligning to the reference genome were filtered out before the quantification steps, and before uploading the processed reads to the European Nucleotide Archive (ENA) (Data Citation 5).

Functional Annotation
Genes were predicted on all contigs using Prodigal 25 version 2.6.3 with the '--meta' tag which potentially uses different coding tables for different contigs. Genes located on contigs longer or equal to 1 kilobase, identified with the script toolbox/scripts/fasta_lengths.py, were used for functional and taxonomic annotation. For functional annotation, the databases EggNOG 17 , Pfam 16 , TIGRFAM (http://www.jcvi. org/cgi-bin/tigrfams/index.cgi) and dbCAN 18 were chosen. Furthermore, EC-numbers 19 were extracted from the EggNOG annotations.
To annotate genes with EggNOG 17 IDs, the EggNOG hmm file for all organisms, NOG.hmm.tar.gz, version 4.5 was downloaded from http://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/. For performance reasons, hmmsearch was used instead of hmmscan 26 , initially removing all hits with an E-value >0.0001. To select a maximum of one annotation per gene, the hit with highest score was chosen using the script toolbox/scripts/hmmer_filtering/keep_top_score.py. Information about each annotation was downloaded from http://eggnogdb.embl.de/download/eggnog_4.5/data/NOG/NOG.annotations.tsv.gz.
An Enzyme Commision (EC) number 19 was assigned to each EggNOG through the Uniprot 27 proteins included in the EggNOG model, if a majority of its EC-assigned members were assigned to that EC. Note that proteins could have multiple EC numbers assigned and therefore some EggNOGs were assigned multiple EC numbers. The files needed for the conversion were eggnog4.protein_id_conversion. tsv.gz (downloaded from http://eggnogdb.embl.de/download/eggnog_4.5/ on January 9th 2017) and NOG.members.tsv.gz (downloaded from http://eggnogdb.embl.de/download/eggnog_4.5/data/ NOG/ on January 9th 2017). The protein ID conversion file gives EC numbers per reference protein and the members file gives the reference proteins that build each model. The protein with taxaid 400682 and protein ID "PAC" was removed from the protein ID conversion file since it had 695 EC entries. Likewise for taxaid 7070 and protein ID "TCOGS2", with 686 EC entries. The protein ID with the third most entries had 6 entries and therefore the two others were deemed as outliers. The suspected reason is that these entries belong to different genes for these genomes but there were no way to resolve this and the EC-number assignment for each EggNOG was deemed to not be affected by this. Given the assignment of EC-numbers per EggNOG, the assignment per gene was done with toolbox/scripts/assign_ec_from_nog.py.
Annotation against the dbCAN 18 (DataBase for automated Carbohydrate-active enzyme ANnotation) database was performed using version 5 (downloaded from http://csbl.bmb.uga.edu/dbCAN/download. php). Following the instructions from dbCAN (downloaded from http://csbl.bmb.uga.edu/dbCAN/ download/readme.txt), hmmscan 26 was used with the option --domtblout and the result was further treated with the recommended script hmmscan-parser.sh (reference of used script available within toolbox/third_party_scripts/dbcan/hmmscan-parser.sh) from dbCAN requiring a covered fraction of the HMM larger than 0.3 and keeping long alignments (>80 amino acids) if the E-value was less than 1e-5 and short alignments if the E-value was less than 1e-3. An additional script toolbox/hmmer_filtering/ dbcan_strict_filtering.py was applied, choosing to follow recommendations for bacteria from dbCAN, keeping annotations with e-value less than 1e-18 and alignment coverage greater than 0.35. To allow for more than a single domain within a gene, any annotation which fulfilled these criteria was kept. Information about each annotation was collected (downloaded from http://csbl.bmb.uga.edu/dbCAN/ download/FamInfo.txt).
Annotation against Pfam 16 version 30.0 was conducted with the script pfam_scan.pl supplied from the ftp://ftp.ebi.ac.uk/pub/databases/Pfam/Tools for version 28.0, using hmmer version 3.1b1 (ref. 26). To allow for more than a single domain within a gene, any annotation which fulfilled these criteria was kept. Information about each annotation was collected as columns 1,2 and 4 from the file pfamA.txt.gz downloaded from ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam30.0/database_files/ on January 11th 2017.

Taxonomic annotation
The method used to assign taxonomy was chosen in order to assign as many contigs as possible to a taxonomy while still keeping false positives to a low level. As the number of sequences in reference databases closely related to the genomes in our samples was expected to be low 8 , amino acid sequences from the assembly were used to compare against other amino acid sequences in the reference database, enabling higher sensitivity (due to the more conserved nature of amino acid sequences). This comparison was done using Diamond version 0.8. 26 (ref. 28) with the parameters "--seg yes","--sensitive" and "--top 10" against the NCBI nr database downloaded December 2nd 2016.
The code used to assign taxonomy from the Diamond search was based on an original available in the DESMAN package 29 and the modified version of the code is available as the script toolbox/scripts/ taxonomy_from_genes_to_contigs/lca_per_contig.py. The assignment was done as follows: all reported hits from the Diamond search were given a weight based on the aligned fraction of the query and the percentage identity of the alignment. At each taxonomic level, if the sum of the weights for one taxon was greater than half the sum of all weights, the gene was assigned to that taxon as long as the percentage identity was high enough. The levels for the percentage identity were set to 40% at superkingdom level, 50% at phylum level, 60% at class level, 70% at order level, 80% at family level, 90% at genus level and 95% at species level.
Taxonomic assignments were set per contig to the most detailed level where consensus for at least 50% of the weights of the preliminary gene assignments could be achieved. Genes without taxonomic annotation were ignored. The shared assignment was propagated to all genes present on that contig. In this way, all genes present on one contig will always share the taxonomic assignment. If no single superkingdom accounted for a majority of the gene assignment weights for a contig, the contig was left unassigned.

Quantification and Normalization
To use the metagenome assembly as a reference assembly, individual samples are functionally and taxonomically annotated by quantifying the different annotations present in the assembly. This is done by mapping all short reads against the assembly and quantifying genes, and thereby any associated annotation, with the number of reads mapping to them. More specifically bowtie2 (ref. 30 version 0.6.1 was used to get raw counts per gene. Counts per annotation was achieved by summing all counts for genes annotated with each respective annotation. When quantifying annotation types where multiple annotations were allowed for a single gene (dbCAN and Pfam), some genes contributed several times to the quantities. This was kept in order to facilitate analysis of differential abundance for the individual annotations.
Along with raw counts of reads for each annotation type and taxonomy, a count normalized by gene length and number of mapped reads was also calculated. Analogously to the formula for Transcripts Per Million used in transcriptomics (ref 33), we calculate TPM for gene counts: Where r g is the number of reads mapped to gene g from the sample, rl is the average read length for the sample, fl g is the length of the gene and G is the set of all genes. T is a convenience variable for the indicated sum over all genes.

Code availability
Code used to preprocess reads, assemble contigs and annotate genes is publicly available at https://github. com/EnvGen/BLUEPRINT_pipeline, containing the pipeline definition of the workflows used, https:// github.com/EnvGen/snakemake-workflows, where the snakemake rules are specified in order to build the  command used for each step, and the branch BARM_publication of https://github.com/EnvGen/toolbox, for custom scripts. Scripts within the latter repository that have been used have been indicated throughout the text.

Data Records
The preprocessed sequencing reads from the Transect and Redoxcline samples were submitted to ENA hosted by EMBL-EBI under the study accession number PRJEB22997 (Data Citation 5). The raw reads

Technical Validation
The mapping rates for all samples included in the reference assembly are shown in Fig. 2, where the majority of samples included in the assembly reaches a level above 80%. This serves as a validation of the completeness of the metagenome assembly. The fraction of reads that did not map to the coassembly, and were hence not assembled past the 200 bases length cutoff most likely originate from low abundance species, or species with high intraspecies diversity generating fragmented assemblies. The mapping rate of the external samples shows the capability for this assembly to serve as a reference metagenome assembly for the Baltic Sea. These external samples 34  Assignment rates for different annotation types, as shown in Fig. 3, are in the majority of cases below 10% of the total number of reads, which is expected since only genes on long contigs (representing 40% of the bases of the total assembly) were predicted and subjected to annotation. The fraction of reads annotated among reads mapping to genes included in the annotation procedure reaches well over 30% for Pfam and shows the generality of that database as compared to i.e. dbCAN, a much more niched resource, which reaches only around 2% of reads mapping to genes included in the annotation. The functional annotation was further validated through an NMDS plot (Fig. 4) based on the EggNOG annotations of the transect data. Depth was found to be negatively correlated with the first dimension (Spearman's rank correlation ρ = − 0.73, P = 5.4 − 06 ) and salinity was negatively correlated with the second dimension (Spearman's rank correlation ρ = − 0.77, P = 2.4 − 06 ). These two environmental parameters have previously been found to correlate strongly with the microbial community in the Baltic Sea 5 which strengthens our trust in the EggNOG annotations. Furthermore, analyzing a single annotation with a known function, namely the photosynthetic reaction centre protein (PF00124), we could see a strong negative correlation with sampling depth over the thirty transect samples (Spearman correlation coefficient ρ = − 0.87, P = 3.1 -10 ).
The taxonomic annotation was validated by inspecting the taxonomic profile of the transect samples. The same dominant prokaryotic taxonomic groups were observed as in previous pan-Baltic amplicon sequencing and metagenomic studies 5,7,10,11 , and the overall trends were conserved with an increase in Alpha-and Gammaproteobacteria and a decrease in Actinobacteria and Betaproteobacteria with increasing salinity levels (Fig. 5).
Among the predicted proteins in BARM, 98% lacked hits with amino acid identities above 95%, hence potentially representing species for which sequenced genomes are lacking 35 . 31% of the sequences lacked significant hits (E-value >1) and potentially correspond to novel protein families.

Usage Notes
A publicly available repository at https://github.com/EnvGen/BARM_tools hosts instructions and a pipeline on how to quantify genes and their annotations within BARM for any kind of Baltic Sea metagenomic and metatranscriptomic samples.
The web interface BalticMicrobeDB, available to the public at http://barm.scilifelab.se, can be used to explore and access data for the three sample sets that the assembly is based upon. At the index page, the user can choose whether to access functional annotations or taxonomic annotations. For the functional annotations, the user can select specific annotation sources and identifiers and select the sample groups for which the counts will be displayed. Furthermore, a text search over the identifiers and the descriptions of the annotations can be used to create a custom table of counts over the selected samples. For taxonomic annotations, counts for the top level superkingdom are first presented but the user can unfold a taxonomic tree to select any taxon to view counts for.