The environmental metabolome and metabolic potential of microorganisms are dominant and essential factors shaping microbial community composition. Recent advances in genome annotation and systems biology now allow us to semiautomatically reconstruct genome-scale metabolic models (GSMMs) of microorganisms based on their genome sequence1. Next, growth of these models in a defined metabolic environment can be predicted in silico, mechanistically linking the metabolic fluxes of individual microbial populations to the community dynamics. A major advantage of GSMMs is that no training data is needed, besides information about the metabolic capacity of individual genes (genome annotation) and knowledge of the available environmental metabolites that allow the microorganism to grow. However, the composition of the environment is often not fully determined and remains difficult to measure2. We hypothesized that the relative abundance of different bacterial species, as measured by metagenomics, can be combined with GSMMs of individual bacteria to reveal the metabolic status of a given biome. Using a newly developed algorithm involving over 1,500 GSMMs of human-associated bacteria, we inferred distinct metabolomes for four human body sites that are consistent with experimental data. Together, we link the metagenome to the metabolome in a mechanistic framework towards predictive microbiome modelling.
Microbial communities constantly adapt to exploit available resources3. As a result, the presence of specific microorganisms or distributions of microorganisms allow us to infer environmental features. For example, the altered metabolic conditions in the microenvironment of colorectal cancer tumours select for the outgrowth of specific species in the human colorectal cancer microbiome4, allowing cancer detection5. Similarly, microorganisms can serve as biosensors for geochemical features such as solvent or uranium contamination6. These and many other empirical examples of significant associations between the environment and the microbiota7,8 suggest that the composition and metabolic potential of microbial communities can be used to reconstruct the metabolic environment of a biome through a reverse engineering strategy.
Shotgun metagenomics rapidly inventories the composition and genomic content of microbiomes, but obtaining comparably high-resolution metabolomic measurements of microbial environments and linking them to microorganisms remains challenging2. It is often even more challenging to translate the presence/absence of genes or the taxonomic composition of a given environment into predictions about the metabolic status of the community9. Although several tools exist that allow researchers to translate shotgun metagenomes into functions or functional profiles of a given environment10,11,12,13, current methods often require extensive training datasets and provide knowledge at the community-level, without mechanistically attributing differences to specific microorganisms or metabolites.
Capitalizing on advances in the functional annotation of genes, GSMMs attempt to mechanistically explain bacterial growth and metabolism in a defined environment without requiring training data1. GSMMs provide a minimally biased description of the metabolic processes that are encoded in a microbial genome by integrating database knowledge about protein functions into a reaction network that describes the metabolic potential of a genome. Constraint-based approaches such as flux balance analysis (FBA) allow the growth or biomass production of GSMMs to be formulated as an optimization problem and therewith estimated14. However, integrating multiple GSMMs into a microbial community model, and producing meaningful predictions about the metabolic status of the environment, is still an open research challenge9. Here, we addressed this challenge by developing an optimization framework to predict the metabolic environment that best explains the observed species abundance distribution profiles. The algorithm takes metagenomic species abundances and their GSMMs as input, and does not require any further training data.
The main premise of our approach, named MAMBO for Metabolomic Analysis of Metagenomes using fBa and Optimization, is that the abundance distribution of microbial genomes and their encoded metabolic potential reveal how microorganisms can exploit the metabolic resources that are available in the environment. We qualitatively infer these resources by searching for the metabolic environment that yields microbial growth that is best correlated with the relative abundances observed in the metagenome profiles. This computational approach is outlined in Fig. 1. First, we used reference genome sequences to generate GSMMs of the species encountered in a given metagenomic dataset. Also from the metagenomics, we extracted the relative abundances of these organisms in the sample. We can now ask the question, which metabolic environment would lead to relative growth rates of the GSMMs that best correlate with the observed relative abundances of the organisms? In this context, growth is defined as the fluxes in the biomass reactions of the GSMMs. We constrain the GSMMs of all the microorganisms found in a sample by providing them with the same metabolic environment (modelled as a limit to the import reactions of metabolites) and assuming the same cellular objective of growth. Finally, we use a semi-Markov chain to sample the highly dimensional metabolome space, optimizing for the metabolomic composition that leads to an optimal correlation of the GSMMs growth profile with the microbiome metagenomic abundance profile. Thus, an optimization run predicts the relative metabolite abundances in an environment, as shown for a typical run in Fig. 2. Importantly, our approach mechanistically links the environmental metabolome to the metabolic fluxes in genome-scale reconstructions of the metabolism of each individual microorganism, because the optimization is performed by FBA solutions on a per-genome basis.
We tested our approach by inferring the metabolomic environment in four human body sites (oral, skin, stool and vaginal) using 175 metagenomic datasets15 and GSMMs that we reconstructed for 1,562 detected bacteria16. The inferred metabolomes revealed four clusters corresponding to the four body site biomes (Fig. 3), as were previously observed for the body site-specific microbiomes15. This clustering was independent of the initialization of our algorithm in the high-dimensional metabolome search space. For example, searches based on oral metagenomes that were initiated with predicted skin metabolomes quickly converged to the oral metabolome cluster and the same pattern was observed for the other body sites (Supplementary Fig. 1a–d). Repeated metabolomes inferred from the same metagenome had an average Pearson correlation of 0.96 ± 0.02, showing high robustness and consistency of the algorithm.
To benchmark our algorithm on experimental data, we identified six annotated, quantified, high-throughput metabolomes from saliva, faeces and vagina17,18,19,20,21,22,23 and correlated the metabolites measured in these studies with our predictions from 175 Human Microbiome Project (HMP) metagenomes. Figure 4 shows that the predicted and measured metabolomes for these body site biomes are consistent: metabolomes inferred from oral, stool and vaginal metagenomes correlated significantly better with those measured in saliva, faecal water and faecal incubator, and vagina, respectively, than with metabolomes from other body site biomes (P = 3.5×10−52, one-tailed unpaired t-test).
A recent screen of metabolites on human skin revealed the influence of skin care and hygiene products on the microbiome24. Using MAMBO, we predicted metabolomes from 50 skin metagenomes, confirming the abundance of various cosmetic and hygiene ingredients (Supplementary Table 1). Moreover, we assessed the abundance of skin metabolites across twelve samples where both microbiomes and metabolomes had previously been measured24, including triplicates from the hand and foot of a male and female volunteer. In these samples, differently from the metabolomic studies above, metabolites were measured by using an untargeted approach, which only allowed for a metabolite-by-metabolite comparison across samples. For the majority of metabolites the measured and predicted abundances correlated positively across these four skin sites (P = 0.0064, one-tailed binomial test; Supplementary Fig. 2), showing that most metabolites can be distinguished between samples from the same biome.
We evaluated if the gene composition alone is sufficient to infer the environmental metabolome without the need to reconstruct GSMMs and use optimization. For this purpose, we adjusted an existing algorithm, named Predicted Relative Metabolic Turnover (PRMT) score10,25, that predicts the metabolic turnover in one sample relative to another and applied it to the same reference mapping of 175 HMP samples as above15,16, and evaluated how well this gene-based analysis predicted the relative abundances of metabolites. As shown in Supplementary Fig. 3, the resulting metabolite lists showed a lower correlation with the measured metabolomes than the GSMM-based MAMBO analysis. On closer inspection, this limited correlation of the gene-based predictions results from some metabolites being spuriously predicted at high abundance. For example, malate is consistently predicted to be abundant in stool because several stool bacteria contain multiple malate dehydrogenase genes. However, the malate concentration is low in faecal water20 and faecal incubator17 metabolomes, but high in the vaginal metabolome18. In contrast, by exploiting GSMMs rather than individual genes, MAMBO predicted that a high concentration of malate was not important to achieve the observed abundance profile of bacteria in most stool samples, while it was important for most vaginal samples, consistent with the experimentally measured metabolomes.
It should be noted that PRMT predicts the relative bacterial consumption or production of metabolites compared to an average, and is thus not directly aimed at predicting the net production or consumption of metabolites10,25. The main differences between the gene-based predictions and MAMBO are (1) PRMT does not offer an assessment of the metabolome on a sample-by-sample basis, but rather a comparison of a sample to an average metabolome in order to highlight the largest differences; (2) PRMT assumes that the relative abundance of a metabolite is proportional to the relative number of genes coding for an enzyme, an assumption we avoid by using FBA where enzyme fluxes are defined by optimizing for growth; and (3) the PRMT connectivity matrix is not compartmentalized by species, so PRMT does not require GSMMs.
The MAMBO metabolome prediction framework takes an important step towards predictive microbiome modelling. GSMMs have previously been applied to microbial communities26,27 but those GSMMs depend on an explicit definition of the environment including the metabolites and their relative abundances, to allow the modelled microorganisms to grow. However, the environmental metabolome remains unclear for many measured microbiomes, either because the metabolomic experiments are lacking, or because they were measured in a slightly different system or at a different scale than observed by the microorganisms in the metagenome. Here, we bridge this gap by predicting the environmental metabolome directly from the metagenome. There is still some noise in these predictions; for example, several predicted metabolomes show high correlations to measured metabolomes from other biomes (Fig. 4). First, there are many factors that influence the microbial abundances besides the available metabolites, including bacteriophages28 and human factors29, to name a few. Taking these and other factors into account may improve the performance of a tool to infer the environmental metabolome, and further contribute to predictive microbiome modelling. From a technical perspective, the experimentally measured metabolomes used in our comparisons are derived from biofluids with a complex composition, and it remains challenging to link the mass-over-charge (m/z) peaks in mass spectrometry spectra to specific metabolites30. For example, we could only identify the metabolites in untargeted metabolomics experiments24 by mapping the m/z values to published standards (see Methods). Structurally different metabolites often have identical chemical composition and m/z values, so identification of metabolites based on m/z alone is inherently inaccurate. Finally, the metabolomic and metagenomic datasets that we exploited were measured and published independently in samples that may differ in unknown ways; for example, a faecal incubator versus fresh stool. Nevertheless, we could predict metabolomes for different body site biomes that significantly correlate with the experimental data (P value = 3.5×10−52, one-tailed unpaired t-test), and find positive correlations for most metabolites across paired samples from the same body site.
Without requiring training data, MAMBO implicitly exploits the fact that microorganisms in an ecosystem constantly compete for resources, leading to a relative abundance distribution that reflects their ability to exploit these resources. Metagenome-guided modelling enables a deeper understanding of microbiomes by linking the environmental metabolome to the metabolic network of individual microbial populations. By explicitly modelling the fluxes of individual GSMMs that are matched with the species composition of the system in a probabilistic fashion, our approach provides a starting point for mechanistic models of microbial ecology, including the potential for systems with more complex cross-feeding networks9.
From the US Department of Energy Systems Biology Knowledgebase (KBase, http://www.kbase.us) and the HMP (http://www.hmpdacc.org) we downloaded the human microbiome reference genomes15, as well as taxon-abundance profiles for 175 metagenomes, respectively, including 37 oral, 50 skin, 39 stool and 49 vaginal metagenomes (listed in Supplementary Table 2). The abundance profiles were previously generated according to the HMP standard operating procedure16 (http://www.hmpdacc.org/doc/ReadMapping_SOP.pdf), where 57.6% of the sequenced reads were aligned to the reference database across all HMP metagenomes (see section 4.5 of the SOP document). Additionally, we included 372 GSMMs from a recent study26 of genomes that were not in the HMP list. Metabolites were matched to the SEED database using the conversion table provided by the authors, and the genome sequences were obtained from the NCBI nucleotide database.
Experimentally measured metabolomic profiles were obtained from the Human Metabolomics Database19, including one from faecal water17 and three from saliva21,22,23. One faecal incubator17 and one vaginal18 metabolome were obtained from recent literature.
We used data from a recent study of the metabolites on human skin24 to compare predicted and measured metabolites across four different skin sites where both the microbiome and metabolome were measured. We obtained 16 S amplicon datasets and raw capillary gas chromatography mass spectrometry spectra for 12 skin samples, including triplicates from two body sites of two individuals24. We used the Burrows Wheeler Aligner31 to map the 16 S reads to the genomes in our database (average 74.5% of reads mapped). We created a database containing all 214 metabolites exported by the GSMMs for which the retention time and mass-to-charge ratio (m/z) were annotated in the Human Metabolomics Database19. We then used these parameters to annotate the capillary gas chromatography mass spectrometry peaks from the skin study. We used MZ-Mine32 to identify features and align and deconvolute the raw spectra, generating a normalized peak list consisting of retention time versus m/z. Thus, 21 peaks could be unambiguously mapped to GSMM metabolites across all twelve samples. Next, we compared metabolite abundances across, rather than within, samples, since the raw spectra of different metabolites in an untargeted metabolomics study are not comparable within a sample33. Thus, the area under the peaks was used as an indicator of metabolite abundance across samples, and used in our analysis (Supplementary Fig. 2).
We used the ModelSEED pipeline1 to generate GSMMs for the 1,562 HMP reference genomes that were present in at least one of the metagenomic datasets (available at https://github.com/danielriosgarza/MAMBO). Briefly, genomic annotations were used to identify the biochemical reactions in a species’ metabolic network. The molecular stoichiometry of these reactions was expressed in a matrix that transforms reaction rates to the time-derivative of metabolite concentrations. The nullspace of this matrix contains the equilibria solutions for reaction rates. Parsimonious gap filling was applied by adding the minimal possible set of reactions to the model that are essential for a model to grow; that is, to yield a flux through the biomass reactions1. Gap-filled reactions were probably missed during sequencing, assembly or genome annotation. We excluded dead-end exchange reactions from the models that remained unresolved after gap filling or had no influence on the objective function. FBA simulations were performed in a Python 2.7 environment, using the COBRApy package for constraint-based modelling34 and Gurobi 5.6.3 (http://www.gurobi.com) or GLPK 4.35 (http://www.gnu.org/software/glpk) as linear programming solvers. To reflect the constant competition between microorganisms, we used growth as the objective function in the FBA14.
Metabolomic Analysis of Metagenomes using fBa and Optimization algorithm
For the MAMBO algorithm, explained in the main text and depicted in Fig. 1, we constrained the GSMMs of all the microorganisms found in a metagenome by providing the same metabolic environment (modelled as an upper bound to the import reactions of metabolites) and assuming the same cellular objective of growth. We used semi-Markov chain sampling embedded on a Metropolis-Hastings algorithm35 to identify the metabolomic composition that optimally correlates with the abundance profile of the microbial genomes observed in the metagenome.
The input of the algorithm consists of (1) a list of microorganisms and their relative abundances and (2) a database of GSMMs generated from the genomes of these microorganisms. Thus, the approach depends on the availability of high-quality draft reference genome sequences, as are available for the microorganisms found in the human microbiome and increasingly also for other environments. Typically, one GSMM will have 35–80 exchange reactions representing the metabolic compounds that the organism can utilize. Depending on the complexity of the microbiome, the GSMMs of all the microorganisms in a community together will be able to utilize >200 different metabolites. These combined exchange reactions represent the metabolites whose environmental concentrations are inferred by MAMBO.
At the core of the approach is an optimization algorithm that searches the >200-dimensional metabolome search space for a composition of the metabolomic environment that, when applied simultaneously to the GSMMs of all coexisting microorganisms using FBA, yields a relative biomass production profile b that correlates with the abundance profile m of the microorganisms in the metagenome. The metabolic compound concentrations are modelled in the FBA as an upper bound to the influx reaction. We use Monte Carlo optimization following a semi-Markov chain to search the highly dimensional solution space. After random initialization (or initialization with a decoy metabolome as in Supplementary Fig. 1a–d), a new candidate environment e’ is generated from the current environment e, by slightly altering the concentration of one metabolite following a uniform distribution. The maximum biomass production rates of all GSMMs are then evaluated for the candidate environment, and the change is accepted if the Pearson correlation of the metagenomic abundances with the growth rates in the candidate environment, ρ(m,b e’ ), is higher than for the current environment, ρ(m,b e ), or with a uniform probability ρ(m,b e’ )/ρ(m,b e ) otherwise. Every 150 search steps, the algorithm evaluates the past outcomes and chooses the environment that yielded the highest correlation36. Samples were first subjected to 100,000 search steps, and 100,000 steps were subsequently added until a high Pearson correlation (ρ 0.6) with the target metagenomic abundance profile was achieved. Finally, the 10% time points with the highest Pearson correlation scores between the biomass profile and the metagenomic abundance profile were averaged, yielding a robust predicted metabolome (Fig. 2). Note that the correlation that is optimized using the semi-Markov chain is the correlation ρ(m,b e ) between the metagenomic species profile m and the biomass production rates b e , while the correlations that are shown in our results (for example, in Fig. 4 and Supplementary Table 3) are correlations between the predicted metabolome e and experimentally measured metabolic concentrations.
Comparison with individual gene-based metabolome prediction
For gene-based comparison we generated PRMT10,25 scores for the same 175 HMP samples that were used to benchmark the MAMBO algorithm. For this purpose, we first derived a matrix containing the number of genes per genome coding for a given enzyme reaction. This matrix was used to transform the vector of relative bacterial abundances per environment into a vector of normalized enzyme counts, which expresses the relative importance of enzymes given a metagenomic species abundance profile. Second, we built a table mapping all the enzymes found in the previous step to metabolites, which was expressed as a large connectivity matrix with metabolites as rows and enzymes as columns, and was normalized by rows. This matrix was used to transform the vector of normalized enzyme counts into a vector of predicted scores per metabolites. The predicted scores were quantile-normalized and compared to the average scores across all samples to produce the sample-by-sample PRMT scores, which are expressed as fold changes of metabolite importance in a given sample relative to the average importance across all samples.
Statistical analyses were performed on a Python 2.7 environment, using the “stat” statistical package of Scipy 0.15.1. Principal coordinate analyses were performed using the scikit-learn package.
Life Sciences Reporting Summary
Further information on experimental design is available in the Life Sciences Reporting Summary.
The Cython/Python implementation of MAMBO nr. 1 is available at https://github.com/danielriosgarza/MAMBO.
This study strongly depended on recycled data generated by others, as referenced in the appropriate sections above. Moreover, we generated 1,562 GSMMs of human-associated bacteria that can be obtained from https://github.com/danielriosgarza/MAMBO. MAMBO-predicted metabolomic profiles of 37 oral, 50 skin, 39 stool and 49 vaginal metagenomes are listed in Supplementary Table 2, as well as six experimentally measured metabolomic profiles. Correlations between the measured and predicted metabolomes are listed in Supplementary Table 3.
Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
We thank M. Kooyman (SURFsara) for help implementing MAMBO on the Netherlands Life Science Grid, C.R. Berkers (Utrecht University) for insights regarding the annotation of untargeted metabolome datasets and the CMBI Comics Group for fruitful discussions. D.R.G. is supported by the Science Without Borders program of CNPQ/BRASIL. B.E.D. is supported by Netherlands Organization for Scientific Research (NWO) Vidi grant 864.14.004.