Comparative integrated omics: identification of key functionalities in microbial community-wide metabolic networks

Background: Mixed microbial communities underpin important biotechnological processes such as biological wastewater treatment (BWWT). A detailed knowledge of community structure and function relationships is essential for ultimately driving these systems towards desired outcomes, e.g., the enrichment in organisms capable of accumulating valuable resources during BWWT. Methods: A comparative integrated omic analysis including metagenomics, metatranscriptomics and metaproteomics was carried out to elucidate functional differences between seasonally distinct oleaginous mixed microbial communities (OMMCs) sampled from an anoxic BWWT tank. A computational framework for the reconstruction of community-wide metabolic networks from multi-omic data was developed. These provide an overview of the functional capabilities by incorporating gene copy, transcript and protein abundances. To identify functional genes, which have a disproportionately important role in community function, we define a high relative gene expression and a high betweenness centrality relative to node degree as gene-centric and network topological features, respectively. Results: Genes exhibiting high expression relative to gene copy abundance include genes involved in glycerolipid metabolism, particularly triacylglycerol lipase, encoded by known lipid accumulating populations, e.g., Candidatus Microthrix parvicella. Genes with a high relative gene expression and topologically important positions in the network include genes involved in nitrogen metabolism and fatty acid biosynthesis, encoded by Nitrosomonas spp. and Rhodococcus spp. Such genes may be regarded as ‘keystone genes’ as they are likely to be encoded by keystone species. Conclusion: The linking of key functionalities to community members through integrated omics opens up exciting possibilities for devising prediction and control strategies for microbial communities in the future.

For the quality assessment of the isolated genomic DNA, fractions were separated by electrophoresis on a 1 % agarose gel containing 4 ‰ ethidium bromide (PlusOne Ethidium Bromide, GE Healthcare). For size estimation, the MassRuler DNA ladder mix (Fermentas) 30 was loaded onto the gels. Agarose gels were visualized on an InGenius gel imaging and analysis system (Syngene, Cambridge, UK; as described in Roume et al. 1 ). The DNA pool sample was snap-frozen in liquid nitrogen in the elution buffer and stored at -80 °C until library preparation and sequencing.
RNA quality assessment and quantification was carried out using an Agilent 2100 35 Bioanalyzer (Agilent Technologies, Diegem, Belgium; as described in Roume et al. 1 ).
The quality of protein extracts was assessed following 1D-SDS-PAGE separation.
6 allowing the collection of beads. 45 µl of 10 mM Tris-HCl buffer were removed from the large-fragment wells, leaving all beads in the well. The plate was removed from the magnetic stand and 190 µl of sample from sonication tubes was added to the large fragment wells on the Costar plate. Following 8 min of incubation at room temperature, using two magnetic stands, the large-fragment Costar plate was placed on a magnetic stand to collect beads. 110 Following 5 min incubation at room temperature, 58 µl of bead buffer were removed from the small fragment wells, leaving behind beads and approximately 2 µl of bead buffer. Smallfragment wells were then removed from the magnetic stand. 300 µl of supernatant were transferred from large-fragment wells into small-fragment wells. After transfer, the largefragment plate was discarded. Samples were then incubated in small-fragment wells for 115 5 min and the plate was placed on a magnetic plate to collect beads. 300 µl of supernatant were removed and replaced by 300 µl of 80 % ethanol without disturbing beads. Following 30 s of incubation, ethanol was removed. This last step was repeated two times. Following ethanol removal by air-drying for 5 min, the small fragment plate was removed from the magnetic stand and 45 µl of pre-warmed 10 mM Tris-HCl (pH 8.5) were dispensed into the 120 wells. After re-suspension of beads by vigorous mixing, the solution was incubated for 2 min and placed on a magnetic plate to collect beads. Following a further 3 min incubation, 42.5 µl of supernatant containing the size-selected products were transferred to a new plate.
All enzymes and reaction buffers used for the end-repair, dA-tailing, ligation and PCR amplification were provided from the KAPA Library Preparation Kits with Standard PCR 125 Library Amplification/Illumina series (KapaBiosystems, Woburn, MA, USA).

End-repair
To the supernatant containing the sheared DNA, 10 µl of end-repair buffer and 5 µl of endrepair enzyme mix were added. 100 µl of this solution were added to each well. The plate was then covered, vortexed briefly and spun down before being incubated for 30 min at 130 20 °C in a thermocycler. The samples were then cleaned using Agencourt AMPure SPRI beads (Beckman Coulter). 90 µl of SPRI beads were added into each well, the plate was

A-tailing
To 30 µl of end-repaired DNA, 5 µl of 10x A-tailing buffer, 3 µl of A-tailing enzyme and 140 12.µl of water were added. 50 µl of A-tailing master mix were added to each well. The plate was then covered, vortexed briefly and spun down before being incubated for 30 min at 30 °C in a thermocycler. The sample was then cleaned up using the Agencourt AMPure SPRI bead method with 90 µl of SPRI beads placed in each well. Each sample was then eluted with 36 µl of 10 mM Tris buffer (pH 8.0). The same six samples, randomly chosen for the DNA 145 fragmentation step, were run on an Agilent Bioanalyzer DNA 1000 chip and the average sample concentration was calculated using the "integrated peak" function. 9 Final library quantification by qPCR The KAPA Library Quant kit (KapaBiosystems) for final library quantification was used 170 (Illumina). Briefly, after an initial 1:1 000 dilution in 10 mM Tris-HCl, pH 8.0 + 0.05 % (v/v) Tween 20, the 2x KAPA SYBR FAST qPCR master mix was used to amplify the DNA library with six other standards on an ABI 7900 thermocycler. The qPCR step was conducted using the following cycling conditions: 95 °C for 5 min followed by 35 cycles at 95 °C for 30.s and at 60 °C for 45 s. The concentration of the library was established using the 175 standards according to the manufacturer's instructions. Both the standards and the libraries with unknown concentration were run in triplicate. Prior to sequencing, the libraries were diluted to the required concentration (e.g. 4.5 pM) by following the Illumina cluster generation protocol.

Sequence assembly
For each of the two sampling dates, the raw paired-end 100 nt read metagenomic and metatranscriptomic sequences were processed separately first using the PAired-eND Assembler 4 (PANDAseq, Supplementary Figure 1, step 1) to assemble overlapping read pairs. PANDAseq was run with a score threshold of 0.9 and 25 nt minimum overlap requirement to determine the location of the amplification primers, identify the optimal 185 overlap between reads, correct sequencing errors and check length and base quality. The reads selected by the PANDAseq assembler were extracted from the raw sequence files using in-house Perl scripts. The remaining non-redundant paired-end reads were trimmed using the trim-fastq.pl script from the PoPoolation package 5 using a quality-threshold of 20 (1 % probability of miscall) and a minimum length of 40 nt resulting in two quality trimmed read 190 sets, one including still paired-end and one containing only single-end reads, where the other read pair was discarded during quality trimming. Metagenome and metatranscriptome FASTQ files were then combined into a combined FASTQ file. All PANDAseq and singleend reads were then combined into a single FASTQ file. Paired-end and single-end reads were made non-redundant using CD-HIT-dup 6 (Supplementary Figure 1, step 2).  redundant reads were then used as input for the MOCAT assembly pipeline 7 (Supplementary

205
Extraction and classification of reads mapping to rRNA genes in the metagenomic data EMIRGE 9 was run using metagenomic reads quality filtered to minimum average QV = 30 and a minimum of 40 bp with the trim-fastq.pl script from the PoPoolation package 5 . The reference database used was the truncated (non-redundant) small ribosomal subunit SILVA 10 database release 111. The consensus sequences at 100 iterations were extracted and named based on their identification in the reference database, without imposing any normalized 210 posterior probability.

Generation of additional metagenomic data for contig extension and analysis
To obtain an additional high-depth metagenomic dataset, a total of four additional floating sludge samples islets were collected on 23 rd February 2011, each representing a biological replicate. Biomolecular extraction was performed on 200 mg of starting material as for the 215 other two dates, using the protocol described by Roume et al. 1 . The resulting DNA fractions were sequenced under sample names I, II, III and IV using the sequencing protocol described above. Four technical replicates from sample I were generated, resulting in four libraries I-1, I-2, I-3 and I-4. While the remaining biological replicates II, III and IV were sequenced once each, generating a total of 12.6 gigabases metagenomic sequence. The reads were then 220 assembled using AMOS 11 and MetaVelvet 12 .

Gene annotation
Non-redundant contig files were split into two distinct files, one file with contigs of a length below 500 bp and another file with contigs lengths above or equal to 500 bp. The contigs 225 with lengths below 500 bp were annotated using

Pre-processing of protein fraction for high-throughput analysis
Following 1D-SDS-PAGE electrophoresis and staining with Imperial protein stain (Thermo Scientific, Erembodegem, Belgium; as described in Roume et al. 1

Liquid chromatography
Peptides obtained from the 1 mm-gel bands were separated using an Eksigent Nano 2D LC

Mass spectrometry
Following LC separation, the peptides were analysed on a LTQ-Velos Orbitrap (Thermo-Fisher, San Jose, CA, USA). MS1 data were collected over the range of 300 -2 000 m/z in the Orbitrap at a resolution of 30 000. Fourier-transform mass spectrometry (FTMS) preview 270 scan and predictive automatic gain control (pAGC) were enabled. The full scan FTMS target ion volume was 1 x 10 6 with a maximum fill time of 500 ms. MS2 data were collected in the LTQ-Velos with a target ion volume of 1 x 10 4 and a maximum fill time of 100 ms. The 10 most intense peaks were selected (within a window of 2.0 Da) for higher-energy collisional dissociation at 15 000 resolution in the Orbitrap. Dynamic exclusion was enabled in order to 275 exclude an observed precursor for 180 s after two observations. The dynamic exclusion list size was set at the maximum 500 and the exclusion width was set at ±5 ppm based on precursor mass. Monoisotopic precursor selection and charge state rejection were enabled to reject precursors with z = +1 or unassigned charge state.

Protein identification
For MS analysis, Thermo .RAW files were converted to mzXML format using MSConvert (ProteoWizard 16 ) and searched with X!Tandem 17 version 2011.12.01.1. Spectra were searched against the metagenomic and metatranscriptomic data, common lab protein contaminants, and decoys. Redundancy was removed from these three data sets using added, for a total of 66 entries. Decoys were generated with Mimic (www.kaell.org), which randomly shuffles peptide sequences between tryptic residues, but retains peptide sequence homology in decoy entries. 290 Search criteria used for X!Tandem included a precursor mass tolerance of 15 ppm and a fragment mass tolerance of 15 ppm for higher-energy collisional dissociation spectra.
Peptides were assumed to be semi-tryptic (cleavage after K or R except when followed by P), Accurate mass binning was employed to promote PSMs whose theoretical mass closely matched the observed mass of the precursor ion, and to correct for any systematic mass errors. Decoys and the non-parametric model option were used to improve PSM scoring. 305 Protein identifications were inferred with ProteinProphet. The ProteinProphet scores were then analysed in iProphet 20 , which combines results from multiple fractions and multiple database searches (although here, only X!Tandem was used) and assigns a probability for each unique protein and its corresponding peptide sequences. The false discovery rate for a given iProphet probability was calculated using the number of decoy protein inferences at 310 that probability. Only proteins identified at iProphet probabilities corresponding to a false discovery rate (FDR) less than 1.0 % were further considered.

KO annotations and protein quantification
The corresponding sequences of the identified proteins were collected in FASTA format from the non-redundant single peptides/protein sequences database previously generated from the 315 combined metagenome and metatranscriptome assemblies. Relative protein quantitation was performed using the normalized spectral index (NSI) measure using an in-house software tool called NSICalc as previously described 21 . The amino acid sequences of identified proteins were then mapped to the KO library (KEGG database version 64.0) using BLAT 15 (e-value<10 -5 , %identity>50, score>50). From the resulting list of KOs, the frequency of each 320 KO was determined at the protein level, using an in-house developed Perl script.

Gene copy and transcript abundances
To account for differences in read depth and sampling, the number of raw sequence reads For each library, reads were mapped to genes and counted, except for reads mapping to multiple genes, for which weighted proportions were used. Next, to obtain the abundances of genes and transcripts, read counts were normalized by the length of the respective gene sequences 22 , to obtain normalized gene copy abundances and normalized transcript 335 abundances, respectively. Normalized gene copy abundances per KO were obtained by calculating the sum of normalized gene copy abundances from all genes belonging to the same KO group. Similarly, the KO-wise transcript abundances were calculated as the sum of normalized transcript abundances over all genes within the same KO.

Relative gene expression
Relative expression of KOs was determined by dividing the normalized transcript abundance of each KO by the inferred gene copy abundance of the same KO 23 .
The relative expression of a KO is greatly dependent on the normalized gene copy abundance, if this value is close to zero. Similarly, KOs with normalized gene copy abundances close to zero are prone to be falsely identified as highly expressed due to their 345 very low gene copy abundances. Therefore, highly expressed KOs were selected based on normalized gene copy and transcript abundances, in addition to relative expression. KOs were considered highly expressed, if their relative expression was above the 90 th percentile.
In addition, to avoid false positive identification of highly expressed KOs, the normalized transcript abundances of highly expressed KOs had to be above the 3 rd quartile of the 350 normalized transcript abundances of all KOs or normalized gene copy abundances had to be above an empirically determined threshold. This threshold was determined by sorting KOs by their normalized gene copy abundances and applying a sliding window approach to determine the lowest normalized gene copy abundance with an average relative expression robustly within the interquartile range (see Supplementary Figure 2). 355

Sensitivity of gene expression analysis to imposed cut-offs
Results of the analysis of KOs exhibiting high relative expression are dependent on the quantile of KOs considered highly expressed (we considered KOs with a relative expression above the 90 th percentile), as well as the lower cut-off which was set for gene copy abundances to avoid false positive identification of KOs exhibiting very low gene copy 360 abundances and only slightly higher transcript abundances as highly expressed. Therefore, we first analysed, if the numbers and identities of KOs identified based on their relative expression were robust to different levels of noise added to the data. In addition, we analysed whether the conclusions would also be robust to small to moderate changes in the selected cut-off values. 365 To address the first point (robustness to noise), we changed the gene copy and transcript abundances by a random number following a uniform distribution within different limits. The lowest of these limits corresponded to +/-1 read mapped per kilobase of metagenomic sequence and the highest to +/-50 reads mapped per kilobase. We then analysed the numbers and identities of the highly expressed genes in 100 repetitions of each test, given the chosen 370 cut-offs and the identities of enriched pathways within the selected sets of KOs. We compared the results to the same analysis carried out using a single numerical cut-off for relative expression (minimal relative transcript abundance = 10 times relative gene copy abundance).
To address the second point (robustness to changes in cut-offs), we changed each cut-off 375 within small to moderate limits. For the inclusion of KOs irrespective of their gene copy abundances, cut-offs at steps between the 55 th to 95 th percentile of transcript abundances were used. For the exclusion of KOs with low gene copy abundances, different cut-offs of robust relative expression were set between the 20 th and 95 th percentile. We then analysed the numbers and identities of the KOs found to be highly expressed, as well as the pathways 380 enriched in these KOs.

Comparison of the metagenomic dataset with the metatranscriptomic and metaproteomic datasets
The congruency of metagenomic and metatranscriptomic datasets was determined by calculating the proportion of KOs with at least one gene having at least 10 metagenomic reads per kilobase mapping to it and also at least one gene resulting in the mapping of 10 385 metatranscriptomic reads per kilobase. The congruency of the metagenomic and metaproteomic datasets was calculated analogously as the proportion of KOs with at least one gene mapped by at least 10 metagenomic reads per kilobase that also had at least one gene identified at the protein level.

Analysis of pathway membership
Assignment of KOs to KEGG pathways was done by using the KO to pathway link (http://rest.kegg.jp/link/Ko/pathway) in the KEGG database version 67.1. Enrichment of KOs 19 in specific pathways was tested using a hypergeometric test and p-values were adjusted using FDR-control 24 . Test results with adjusted p-values below 0.05 were considered significant.

Community-wide metabolic network reconstructions
The KO to R (reaction) link (http://rest.kegg.jp/link/Ko/reaction) was used to associate each individual KO to the corresponding reactions following the R to RP (reaction pair) link

Calculation of gene copy and transcript number and relative expression of regrouped nodes
Gene copy numbers and transcript numbers of nodes which represented several KOs due to their sharing of the same edges were calculated by summing all normalized gene copy and transcript numbers, respectively. Relative expression of a node was calculated by dividing the 420 node-wise sum of normalized transcript abundances by the node-wise sum of normalized gene copy abundances, thereby levelling relative expression of the regrouped KOs.

Weighted load score
Weighting of compounds in a bi-partite RPAIR-based metabolic network reconstruction has been shown to increase pathfinding accuracy 27 . The number of occurrence described above 435 was assigned as edge weight within the metabolic network reconstructions. A weighted betweenness centrality was calculated using the R-package igraph 28 . Alternative weighted load scores were calculated for each node from the weighted betweenness centralities and the degree as defined in the manuscript, and potential key functionalities were determined using this measure and expression as described in the main manuscript. The resulting weighted 440 load scores and the identities of the key nodes were compared to the unweighted results discussed in the manuscript.

Alignment of contigs encoding key functionalities
Contigs were selected based on the following criteria: (1) they encoded a gene annotated with a KO representing a key functionality, and (2) the expression of this gene was corroborated by at least one mapped metatranscriptomic read. Selected contigs were aligned to the NCBI non-redundant nucleotide database using BLASTn with default parameters 29 . The best hit with 450 a query coverage above 50 % and a percentage identity above 80 % for each contig was documented. In addition, contigs containing genes annotated as K03921 were aligned to 85 isolate genomes from the same biological wastewater treatment (BWWT) plant using BLAST and selecting only results with percentage identity above 80 %. additional metagenomic data for contig extension and analysis) from a different sample, using the AMO contigs from the previous step as a reference. This procedure was performed, because the genes encoding the subunits of AMO are known to exist in a cluster/operon 34 .
Following the run on minimus2, gene calling was performed on the resultant contigs using FragGeneScan 13 and Prodigal 14 using default parameters as previously used. Predicted 480 23 amino acid sequences were then merged and made non-redundant based on 100 % sequence identity using CD-HIT 35 and were re-annotated with KO IDs using our annotation pipeline.
Gene calling and re-annotation steps were performed to ensure that the extended contigs retained their original annotation reference. The extended contigs were then used for downstream analysis in order to associate these AMO genes to a bacterial species. 485

AmoA phylogenetic analysis
Nearly complete amino acid sequences (201 -274 amino acids) of AmoA and/or MmoA from representative organisms belonging to the beta-Proteobacteria, gamma-Proteobacteria and archaea were retrieved from the Refseq protein database (see Supplementary Table 3) and aligned with ClustalOmega (using default parameters). The alignment file was submitted 490 to a phylogenetic analysis using the Phylogeny.fr customized workflow service 36 including alignment curation with Gblocks 37 (using default parameters), tree construction with PhyML 38 (bootstrap of 100), and visualization by TreeDyn 39 .

Isolate LCSB065 isolation
Isolate LCSB065 was obtained from an OMMC biomass sample diluted by a factor of 10 4 . 495 The biomass was first cultivated on Petri dishes of wastewater-agar medium (1.5 % agar; w/v) in 800 ml filtered (0.2 µm, Sartorius, Göttingen, Germany) wastewater from the Schifflange BWWT plant. A single colony was then transferred to a Petri dish with R2A medium 40 and cultivated at 20 °C under aerobic conditions. Isolates were grown on different growth media recommended for the culture of bacteria from water and wastewater, 500 particularly Microthrix parvicella, such as R2A 40  and ImageJ 44 .

Isolate genome sequencing and genome assembly
Following DNA extraction from isolate cultures using the Power Soil DNA isolation kit (MO BIO, Carlsbad, CA), a paired-end sequencing library with a theoretical insert size of 300 bp 520 was prepared with the AMPure XP/Size Select Buffer Protocol as previously described by The resulting contigs were uploaded to and analysed using RAST 49 . According to this analysis, the isolate was a Rhodococcus sp. Similarity of contigs to bacterial genomes (NCBI 535 database, accessed 6 th March 2014) were assessed by BLAT 15 . The bitscore of each hit was recorded and only contigs with a hit to Rhodococcus spp. with a bitscore within 60 % of the best hit's bitscore were selected for the final set. This led to the removal of 120 contigs most of which had low coverage of reads.
Filtered reads were mapped onto the assembled contigs using BWA 50 with default parameters. 540 Reads mapping with mapping quality scores at or above five were used to assess contig coverage (Supplementary Dataset 8). The final set of contigs was submitted to AmphoraNet server to determine the isolate species searching for 31 phylogenetic marker genes 51 , as well as to RAST 49 (accession number 6666666.64457) and annotated.
The COG protein profile of Isolate LCSB065 was determined as previously described by 545 Muller et al. 21 . Annotation of KOs was carried out on protein predictions from RAST as described for metagenomic proteins.

Sensitivity of gene expression analysis to imposed cut-offs
We found that the cut-offs imposed to avoid false-positives within the highly expressed genes 550 and genes encoding key functionalities led to a greater robustness against noise than would be observed if simple numerical cut-offs had been chosen. This was obvious from the simulation of noise, in which the number and variance of highly expressed genes grew with the noise, when a numerical cut-off was chosen, whereas the choice of cut-offs, as defined herein, resulted in mostly stable numbers (Supplementary Figure 3a&b). 555 The identities and numbers of KOs identified as exhibiting high relative expression in our datasets were not overly sensitive to the cut-offs imposed to avoid false-positive results. The numbers of highly expressed KOs decreased by less than 20 % by the exclusion of KOs with low gene copy abundances (Supplementary Figure 3c). As most genes with very high transcript abundances did not have low gene copy numbers (Figure 3), the cut-off selected 560 for inclusion of KOs with high transcript abundances irrespective of their gene copy numbers changed the total number of highly expressed KOs by less than 5 %. If the transcript abundance cut-off for genes with low gene copy abundances was set between the 55 th and 95 th percentile (our default value was the 75 th percentile), an enrichment with the KOs above the 90 th percentile of the highly expressed KOs of 4 or 5 out of the 5 pathways in autumn, and 565 5 out of the 6 pathways in winter was consistently detected (Supplementary Figure 3d). Figure 3e). In particular, the findings of pathways ko00910 "Nitrogen metabolism" and ko00190 "Oxidative phosphorylation" as exhibiting overall high levels of gene expression were resilient to changes in cut-offs. 570 28 In conclusion, the described gene expression analysis is robust to noise, as well as too small to moderate changes in the chosen cut-offs.

Effect of regrouping redundant KOs into single nodes
To carry out a topological analysis of the reconstructed metabolic network, nodes and edges were rendered non-redundant, by representing multiple KOs with identical substrate and 575 product metabolites as a single node. Due to this step, 229 and 220 nodes representing more than one KO were part of the autumn and winter metabolic network reconstructions, respectively. The calculated load scores were overall only mildly affected by the regrouping

Genomic analysis of Isolate LCSB065
Mapping of filtered reads onto Isolate LCSB065's contigs revealed a mean/standard 590 deviation empirical sequencing insert size of 244±43 bp, and a mean read depth per mapped position (coverage) of 27±11x (median 25x).

29
As a first approach to analyse Isolate LCSB065's genetic potential, protein coding genes were annotated with KOs as before for the metagenomic and metatranscriptomic sequences.
Of utmost interest, out of 3 373 protein coding genes that could be annotated with KOs, 420 595 genes (12.5 %) were annotated as KOs belonging to "Lipid metabolism", according to KEGG Orthology. This relates to rank 4 behind "Amino acid metabolism" (19.

List of Supplementary Figures
Supplementary Figure 1 Overview of the assembly and annotation pipeline.

Supplementary Figure 2
Determination of lower thresholds for gene abundances for the selection of highly expressed KOs within the reconstructed community-wide metabolic networks.

Supplementary Figure 3
Results of the sensitivity analyses.

Supplementary Figure 4
Expression of KOs in metabolic pathways at the protein level.

List of Supplementary Tables
Supplementary Table 1 Physicochemical characteristics of the wastewater at the time of sampling in the anoxic tank of the Schifflange BWWT plant.

Supplementary Table 2
Quantitative and qualitative characteristics of biomacromolecular fractions sequentially isolated from the OMMC samples.

Supplementary Table 3
Statistics of the combined assembly of the metagenomic and metatranscriptomic sequence datasets.

Supplementary Table 4
Ammonia-oxidizing organisms with corresponding accession numbers of amoA genes used for the reconstruction of the phylogenetic tree.

Supplementary Figures
Supplementary Figure 1 Overview of the assembly and annotation pipeline. Annotated KOs were used for the subsequent metabolic network reconstructions. QC: quality control, 1-7: steps in the pipeline.