Learning representations of microbe–metabolite interactions

Article metrics


Integrating multiomics datasets is critical for microbiome research; however, inferring interactions across omics datasets has multiple statistical challenges. We solve this problem by using neural networks (https://github.com/biocore/mmvec) to estimate the conditional probability that each molecule is present given the presence of a specific microorganism. We show with known environmental (desert soil biocrust wetting) and clinical (cystic fibrosis lung) examples, our ability to recover microbe–metabolite relationships, and demonstrate how the method can discover relationships between microbially produced metabolites and inflammatory bowel disease.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Input data types and mmvec neural network architecture.
Fig. 2: Simulation benchmarks.
Fig. 3: Metabolites released by M. vaginatus after the biocrust wetting event.
Fig. 4: Investigation of P. aeruginosa-associated molecules.
Fig. 5: Microbe–metabolite co-occurrences across study of hepatocarcinoma progression in the context of innate immunity in a mouse model.
Fig. 6: Microbe–metabolite interactions of the human microbiome in association with IBD samples.

Data availability

The cystic fibrosis sequencing and metadata data can be found at https://qiita.microbio.me/ under study ID 10863. The corresponding GNPS analysis can be accessed at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=34d825dbf4e9466e81d809faf814995b. The biocrust soil data were retrieved from the supplemental section in Swenson et al.30. The HFD murine model case study 16S rRNA data can be found at https://qiita.microbio.me/ under study ID 10856. The HFD murine model case study data are publicly available at https://massive.ucsd.edu/ under MassIVE ID MSV000080918. The GNPS analysis for this study can be accessed at https://gnps.ucsd.edu/ProteoSAFe/status.jsp?task=977d85bba47b4e96bf69872b961b8edd. The IBD data used can be found under https://ibdmdb.org/.

Code availability

The software implementing the mmvec algorithm can be found under https://github.com/biocore/mmvec. Differential abundance analyses in the HFD study were performed using L2-regularized multinomial regression using software available at https://github.com/biocore/songbird. The software used to build the multiomics network can be found at https://github.com/mortonjt/multiomics_network. Biplots were generated using Emperor47.


  1. 1.

    Jansson, J. K. & Baker, E. S. A multi-omic future for microbiome studies. Nat. Microbiol. 1, 645 (2016).

  2. 2.

    Noecker, C. et al. Metabolic model-based integration of microbiome taxonomic and metabolomic profiles elucidates mechanistic links between ecological and metabolic variation. MSystems 1, e00013–e00015 (2016).

  3. 3.

    Mallick, H. et al. Predictive metabolomic profiling of microbial communities using amplicon or metagenomic sequences. Nat. Commun. 10, 3136 (2019).

  4. 4.

    Knight, R. et al. Best practices for analysing microbiomes. Nat. Rev. Microbiol. 16, 410–422 (2018).

  5. 5.

    Meng, C. et al. Dimension reduction techniques for the integrative analysis of multi-omics data. Brief. Bioinform. 17, 628–641 (2016).

  6. 6.

    Gall, G. Le et al. Metabolomics of fecal extracts detects altered metabolic activity of gut microbiota in ulcerative colitis and irritable bowel syndrome. J. Proteome Res. 10, 4208–4218 (2011).

  7. 7.

    Rohart, F. et al. mixomics: an r package for ‘omics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).

  8. 8.

    Wang, B. et al. Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11, 333 (2014).

  9. 9.

    Argelaguet, R. et al. Multi-omics factor analysis—a framework for unsupervised integration of multi-omics data sets. Mol. Sys. Biol. 14, e8124 (2018).

  10. 10.

    Ter Braak, C. J. F. & Verdonschot, P. F. M. Canonical correspondence analysis and related multivariate methods in aquatic ecology. Aquat. Sci. 57, 255–289 (1995).

  11. 11.

    Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

  12. 12.

    Bodein, A., Chapleur, O., Droit, A. & Lê Cao K. A. A generic multivariate framework for the integration of microbiome longitudinal studies with other data types. Preprint at bioRxiv https://doi.org/10.1101/585802 (2019).

  13. 13.

    Friedman, J. & Alm, E. J. Inferring correlation networks from genomic survey data. PLoS Comput. Biol. 8, e1002687 (2012).

  14. 14.

    Kurtz, Z. D. et al. Sparse and compositionally robust inference of microbial ecological networks. PLoS Comput. Biol. 11, e1004226 (2015).

  15. 15.

    Weiss, S. et al. Correlation detection strategies in microbial data sets vary widely in sensitivity and precision. ISME J. 10, 1669–1681 (2016).

  16. 16.

    Vandeputte, D. et al. Quantitative microbiome profiling links gut community variation to microbial load. Nature 551, 507–511 (2017).

  17. 17.

    . & Gloor, G. B. et al. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 8, 2224 (2017).

  18. 18.

    Tang, K., Page, J. S. & Smith, R. D. Charge competition and the linear dynamic range of detection in electrospray ionization mass spectrometry. J. Am. Soc. Mass Spectrom. 15, 1416–1423 (2004).

  19. 19.

    King, R., Bonfiglio, R., Fernandez-Metzler, C., Miller-Stein, C. & Olah, T. Mechanistic investigation of ionization suppression in electrospray ionization. J. Am. Soc. Mass Spectrom. 11, 942–950 (2000).

  20. 20.

    Matuszewski, B. K., Constanzer, M. L. & Chavez-Eng, C. M. Strategies for the assessment of matrix effect in quantitative bioanalytical methods based on HPLC–MS/MS. Anal. Chem. 75, 3019–3030 (2003).

  21. 21.

    Kalivodová, A. et al. Pls-da for compositional data with application to metabolomics. J. Chemom. 29, 21–28 (2015).

  22. 22.

    Jansson, J. K. & Baker, E. S. A multi-omic future for microbiome studies. Nat. Microbiol. 1, 16049 (2016).

  23. 23.

    Lovell, D., Pawlowsky-Glahn, V., Egozcue, J. J., Marguerat, S. & Bähler, J. Proportionality: a valid alternative to correlation for relative data. PLoS Comput. Biol. 11, e1004075 (2015).

  24. 24.

    Tipton, L. et al. Fungi stabilize connectivity in the lung and skin microbial ecosystems. Microbiome 6, 12 (2018).

  25. 25.

    Morton, J. T. et al. Establishing microbial composition measurement standards with reference frames. Nat. Commun. 10, 2719 (2019).

  26. 26.

    Abadi, M. et al. Tensorflow: A system for large-scale machine learning. In Proc 12th Symposium on Operating Systems Design and Implementation 265–283 (USENIX Association, 2016).

  27. 27.

    Quinn, R. A. et al. Niche partitioning of a pathogenic microbiome driven by chemical gradients. Sci. Adv. 4, eaau1908 (2018).

  28. 28.

    Shalapour, S. et al. Inflammation-induced IgA+ cells dismantle anti-liver cancer immunity. Nature 551, 340–345 (2017).

  29. 29.

    Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).

  30. 30.

    Swenson, T. L., Karaoz, U., Swenson, J. M., Bowen, B. P. & Northen, T. R. Linking soil biology and chemistry in biological soil crust using isolate exometabolomics. Nat. Commun. 9, 19 (2018).

  31. 31.

    Baran, R. et al. Exometabolite niche partitioning among sympatric soil bacteria. Nat. Commun. 6, 8289 (2015).

  32. 32.

    Quinn, R. A. et al. A Winogradsky-based culture system shows an association between microbial fermentation and cystic fibrosis exacerbation. ISME J. 9, 1024–1038 (2015).

  33. 33.

    Moree, W. J. et al. Interkingdom metabolic transformations captured by microbial imaging mass spectrometry. Proc. Natl Acad. Sci. USA 109, 13811–13816 (2012).

  34. 34.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with global natural products social molecular networking. Nat. Biotechnol. 34, 828–837 (2016).

  35. 35.

    Maier, R. M. & Soberon-Chavez, G. Pseudomonas aeruginosa rhamnolipids: biosynthesis and potential applications. Appl. Microbiol. Biotechnol. 54, 625–633 (2000).

  36. 36.

    Wood, T. L. et al. Rhamnolipids from Pseudomonas aeruginosa disperse the biofilms of sulfate-reducing bacteria. NPJ Biofilms Microbiomes 4, 22 (2018).

  37. 37.

    Allen, L. et al. Pyocyanin production by Pseudomonas aeruginosa induces neutrophil apoptosis and impairs neutrophil-mediated host defenses in vivo. J. Immunol. 174, 3643–3649 (2005).

  38. 38.

    Quinn, R. A. et al. Chemical impacts of the microbiome across scales reveal novel conjugated bile acids. Preprint at bioRxiv https://doi.org/10.1101/654756 (2019).

  39. 39.

    Paczosa, M. K. & Mecsas, J. Klebsiella pneumoniae: going on the offense with a strong defense. Microbiol. Mol. Biol. Rev. 80, 629–661 (2016).

  40. 40.

    Tiratterra, E. et al. Role of bile acids in inflammatory bowel disease. Ann. Gastroenterol. 31, 266 (2018).

  41. 41.

    Hofmann, A. F. & Eckmann, L. How bile acids confer gut mucosal protection against bacteria. Proc. Natl Acad. Sci. USA 103, 4333–4334 (2006).

  42. 42.

    Begley, M., Gahan, C. G. M. & Hill, C. The interaction between bacteria and bile. FEMS Microbiol. Rev. 29, 625–651 (2005).

  43. 43.

    Okada, Y. et al. Propionibacterium freudenreichii component 1,4-dihydroxy-2-naphthoic acid (DHNA) attenuates dextran sodium sulphate induced colitis by modulation of bacterial flora and lymphocyte homing. Gut 55, 681–688 (2006).

  44. 44.

    Foligne, B. et al. Immunomodulation properties of multi-species fermented milks. Food Microbiol. 53, 60–69 (2016).

  45. 45.

    Ple, C. et al. Combining selected immunomodulatory Propionibacterium freudenreichii and Lactobacillus delbrueckii strains: reverse engineering development of an anti-inflammatory cheese. Mol. Nutr. Food Res. 60, 935–948 (2016).

  46. 46.

    Colliou, N. et al. Commensal Propionibacterium strain uf1 mitigates intestinal inflammation via th17 cell regulation. J. Clin. Invest. 127, 3970–3986 (2017).

  47. 47.

    Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A. & Knight, R. Emperor: a tool for visualizing high-throughput microbial community data. Gigascience 2, 16 (2013).

  48. 48.

    Nasrabadi, N. M. Pattern recognition and machine learning. J. Electron. Imaging 16, 049901 (2007).

  49. 49.

    Pawlowsky-Glahn, V., Egozcue, J. J. & Tolosana-Delgado, R. Modeling and Analysis of Compositional Data. (John Wiley & Sons, 2015).

  50. 50.

    Tomas, M., Ilya, S., Kai, C., Corrado, G. S. & Dean, J. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26 (eds Burges, C. J. C. et al.) 3111–3119 (NIPS, 2013).

  51. 51.

    Koren, Y., Bell, R. & Volinsky, C. Matrix factorization techniques for recommender systems. Computer 42, 30–37 (2009).

  52. 52.

    D. P. Kingma and J. Ba. Adam: a method for stochastic optimization. Preprint at arXiv https://arxiv.org/abs/1412.6980 (2014).

  53. 53.

    Blei, D. M., Ng, A. Y. & Jordan, M. I. Latent dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003).

  54. 54.

    Sankaran, K. & Holmes, S. P. Latent variable modeling for the microbiome. Biostatistics 20, 599–614 (2019).

  55. 55.

    Aitchison, J. & Greenacre, M. Biplots of compositional data. J. R. Stat. Soc. Ser. C Appl. Stat. 51, 375–392 (2002).

  56. 56.

    Aitchison, J. & Ng, K. W. Conditional compositional biplots: theory and application. DUGiDocs https://dugi-doc.udg.edu/handle/10256/657 (2005).

  57. 57.

    Martín-Fernández, J. A., Pawlowsky-Glahn, V., Egozcue, J. J. & Tolosona-Delgado, R. Advances in principal balances for compositional data. Math. Geosci. 50, 273–298 (2018).

  58. 58.

    Bolyen, E. et al. Reproducible, interactive, scalable, and extensible microbiome data science using QIIME 2. Nat. Biotechnol. 37, 852–857 (2019).

Download references


We would like to thank V. Pawlowsky, J. J. Egozcue and S. Holmes for their insights on the geometry of this neural network model. In addition, we would also like to thank N. Bokulich for feedback and contributions on the mmvec software package. T.L.S., M.W.V.G. and T.R.N. acknowledge funding from the Office of Science Early Career Research Program, Office of Biological and Environmental Research of the U.S. Department of Energy under contract number DE-AC02-05CH11231 to Lawrence Berkeley National Laboratory. This study was in part supported by grant P41GM103484 for the Center for Computational Mass Spectrometry and instrument support through National Institutes of Health grants S10RR029121 and R03 CA211211 on reuse of metabolomics data. Y.V.B. is funded by the Janssen Human Microbiome Institute through a collaboration with the Center for Microbiome Innovation. J.T.M. was funded by National Science Foundation grant GRFP DGE-1144086. R.K. and S.J.S. have been funded by Janssen under grant number 20175015 and the Alfred P. Sloan Foundation under grant number G-2017-9838.

Author information

J.T.M. wrote the mmvec algorithm, conducted the benchmarks and ran all of the analyses. A.A.A. and L.F.N. preprocessed and annotated the metabolomics data. A.A.A. provided insights in the HFD study. J.R.F. provided insights behind word2vec and topic modeling. M.H.B. benchmarked SPIEC-EASI. R.A.Q. provided insights behind the cystic fibrosis study and simulations. Y.V.-B. provided insights behind the interpretation of the IBD analysis. M.W. developed the GNPS workflow for mmvec. N.A.B developed the heat map visualizations. A.W. developed the network visualizations. T.L.S., M.W.V.G and T.N. provided insights into the biocrust soils experiment. R.B. provided insights behind the simulation benchmarks. S.J.S provided ecological insights. P.C.D provided insights behind metabolomics. All authors were involved with writing the manuscript.

Correspondence to Rob Knight.

Ethics declarations

Competing interests

Mingxun Wang is the founder of Ometa Labs LLC. The remaining authors declare no competing interests.

Additional information

Peer review information Lei Tang was the primary editor on this article and managed its editorial process and peer review in collaboration with the rest of the editorial team.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Integrated supplementary information

Supplementary Figure 1 Description of the compositionality issue.

(a) An illustration of how false negatives can occur - in the absolute abundance data, there is a strong Pearson correlation between the microbes and the metabolites across (n=50 samples). These correlations disappear when considering the corresponding proportions. (b) An illustration of how false positives can occur - in the absolute abundance data, there is no correlation between the dark green molecule and the dark blue microbe (n=50 samples). However, the proportions of the same dataset show that there is a very strong correlation between the dark blue and the dark green molecule.

Supplementary Figure 2 Illustration of how excessive misannotation rates can occur.

(a) Absolute abundances and relative abundances of microbes/metabolites observed in an environment over time, with each microbe/metabolite colored according to its rate of increase / decrease (n=30). (b) A scale-invariance comparison of statistical methodologies. Points are colored by the corresponding microbes in the interactions; triangle markers represent increasing metabolites and decreasing metabolites. Mmvec is the only method that remains consistent between the absolute and relative abundances.

Supplementary Figure 3 Comparison of Pearson and mmvec on Cystic Fibrosis study.

(a) Estimates of P. aeruginosa associated molecules between Pearson and the conditional probabilities calculated from the mmvec applied to the cystic fibrosis study dataset. The annotations correspond to level 2 or 3 of the metabolomics standards initiative and may correspond to different isomeric species (n=462 molecules). (b) Ranks of Pearson coefficients and conditional probabilities from the mmvec for the Rhamnolipids (n=462 molecules). (c) Pyochelin proportions vs P. aeruginosa proportions.

Supplementary Figure 4 Negative log likelihood and prediction accuracy of mmvec.

Tensorboard visualization of training error and cross-validation error of mmvec on the IBG dataset. Five different runs with differing initialization conditions are shown.

Supplementary Figure 5 GNPS job output.

An example of job on the GNPS website with the job description and the downloadable output files from mmvec.

Supplementary information

Supplementary Information

Supplementary Figs. 1–5 and Supplementary Note

Reporting Summary

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Morton, J.T., Aksenov, A.A., Nothias, L.F. et al. Learning representations of microbe–metabolite interactions. Nat Methods (2019) doi:10.1038/s41592-019-0616-3

Download citation