Skip to main content

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • Review Article
  • Published:

Machine learning for microbiologists

Abstract

Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities.

This is a preview of subscription content, access via your institution

Access options

Buy this article

Prices may be subject to local taxes which are calculated during checkout

Fig. 1: General workflow and examples for machine learning applications in microbiology.
Fig. 2: Practical examples of unsupervised learning tasks.
Fig. 3: Training and testing strategies for supervised machine learning model evaluation.
Fig. 4: Supervised machine learning evaluation methods in a real-data example.

Similar content being viewed by others

References

  1. Bishop, C. M. Pattern recognition and machine learning (Springer, 2006).

  2. Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn (Springer Science & Business Media, 2009).

  3. James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning: with Applications in R (Springer Science & Business Media, 2013).

  4. Murphy, K. P. Probabilistic Machine Learning: Advanced Topics (MIT Press, 2022).

  5. Goodswen, S. J. et al. Machine learning and applications in microbiology. FEMS Microbiol. Rev. 45, fuab015 (2021).

    CAS  PubMed  PubMed Central  Google Scholar 

  6. Topçuoğlu, B. D., Lesniak, N. A., Ruffin, M. T., 4th, Wiens, J. & Schloss, P. D. A framework for effective application of machine learning to microbiome-based classification problems. mBio 11, e00434-20 (2020). This work focuses on applying machine learning to microbiome data for disease prediction, highlighting the important trade-off between model complexity and interpretability, and emphasizing the need for rigorous methodology towards more reproducible machine learning usage in microbiome research.

    PubMed  PubMed Central  Google Scholar 

  7. Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  8. Parks, D. H., MacDonald, N. J. & Beiko, R. G. Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12, 328 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  9. Rosen, G. L., Reichenberger, E. R. & Rosenfeld, A. M. NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27, 127–129 (2011).

    CAS  PubMed  Google Scholar 

  10. McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007).

    CAS  PubMed  Google Scholar 

  11. Patil, K. R., Roune, L. & McHardy, A. C. The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS ONE 7, e38581 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  12. Gregor, I., Dröge, J., Schirmer, M., Quince, C. & McHardy, A. C. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 4, e1603 (2016).

    PubMed  PubMed Central  Google Scholar 

  13. Vervier, K., Mahé, P., Tournoud, M., Veyrieras, J.-B. & Vert, J.-P. Large-scale machine learning for metagenomics sequence classification. Bioinformatics 32, 1023–1032 (2016). This work introduces a machine learning-based approach for tackling the taxonomic binning step, using a supervised approach that balances accuracy and speed and outperforms alignment-based methods.

    CAS  PubMed  Google Scholar 

  14. Diaz, N. N., Krause, L., Goesmann, A., Niehaus, K. & Nattkemper, T. W. TACOA — taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10, 56 (2009).

    PubMed  PubMed Central  Google Scholar 

  15. Sczyrba, A. et al. Critical assessment of metagenome interpretation — a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Davis, J. J. et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci. Rep. 6, 27930 (2016).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  17. Arango-Argoty, G. et al. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6, 23 (2018).

    PubMed  PubMed Central  Google Scholar 

  18. Kavvas, E. S. et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat. Commun. 9, 4306 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  19. Moradigaravand, D. et al. Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. PLoS Comput. Biol. 14, e1006258 (2018).

    PubMed  PubMed Central  Google Scholar 

  20. Rahman, S. F., Olm, M. R., Morowitz, M. J. & Banfield, J. F. Machine learning leveraging genomes from metagenomes identifies influential antibiotic resistance genes in the infant gut microbiome. mSystems 3, e00123–e00217 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).

    MathSciNet  Google Scholar 

  22. Baldi, P. Deep Learning in biomedical data science. Annu. Rev. Biomed. Data Sci. 1, 181–205 (2018).

    Google Scholar 

  23. Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  24. Weimann, A. et al. From genomes to phenotypes: Traitar, the microbial trait analyzer. mSystems 1, e00101–e00116 (2016). This work uses machine learning to predict 67 microbial phenotypic traits from genome sequences, facilitating the analysis of large-scale microbial genomic data.

    PubMed  PubMed Central  Google Scholar 

  25. Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25, 667–678 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  28. Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).

    ADS  PubMed  PubMed Central  Google Scholar 

  29. Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    ADS  CAS  PubMed  Google Scholar 

  30. Ghensi, P. et al. Strong oral plaque microbiome signatures for dental implant diseases identified by strain-resolution metagenomics. NPJ Biofilms Microbiomes 6, 47 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  31. Salosensaari, A. et al. Taxonomic signatures of cause-specific mortality risk in human gut microbiome. Nat. Commun. 12, 2671 (2021).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  32. Kartal, E. et al. A faecal microbiota signature with high specificity for pancreatic cancer. Gut 71, 1359–1372 (2022).

    CAS  PubMed  Google Scholar 

  33. Asnicar, F. et al. Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals. Nat. Med. 21, 321–332 (2021).

    Google Scholar 

  34. Lee, K. A. et al. Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma. Nat. Med. 28, 535–544 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  35. McCulloch, J. A. et al. Intestinal microbiota signatures of clinical response and immune-related adverse events in melanoma patients treated with anti-PD-1. Nat. Med. 28, 545–556 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  36. Routy, B. et al. Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors. Science 359, 91–97 (2018).

    ADS  CAS  PubMed  Google Scholar 

  37. Gopalakrishnan, V. et al. Gut microbiome modulates response to anti–PD-1 immunotherapy in melanoma patients. Science 359, 97–103 (2018).

    ADS  CAS  PubMed  Google Scholar 

  38. Derosa, L. et al. Intestinal Akkermansia muciniphila predicts overall survival in advanced non-small cell lung cancer patients treated with anti-PD-1 antibodies: results a phase II study. J. Clin. Orthod. 39, 9019–9019 (2021).

    Google Scholar 

  39. Davar, D. et al. Fecal microbiota transplant overcomes resistance to anti-PD-1 therapy in melanoma patients. Science 371, 595–602 (2021).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  40. Baruch, E. N. et al. Fecal microbiota transplant promotes response in immunotherapy-refractory melanoma patients. Science 371, 602–609 (2021).

    ADS  CAS  PubMed  Google Scholar 

  41. Palma, S. I. C. J. et al. Machine learning for the meta-analyses of microbial pathogens’ volatile signatures. Sci. Rep. 8, 3360 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  42. Ianiro, G. et al. Variability of strain engraftment and predictability of microbiome composition after fecal microbiota transplantation across different diseases. Nat. Med. 28, 1913–1923 (2022). This study uses machine learning to develop predictive models for selecting optimal donors for faecal microbiota transplantation, making personalized microbiome-targeted treatments more effective.

    CAS  PubMed  PubMed Central  Google Scholar 

  43. Smillie, C. S. et al. Strain tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation. Cell Host Microbe 23, 229–240.e5 (2018).

    CAS  PubMed  PubMed Central  Google Scholar 

  44. Schmidt, T. S. B. et al. Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nat. Med. 28, 1902–1912 (2022).

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011).

    CAS  PubMed  PubMed Central  Google Scholar 

  46. Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl Acad. Sci. USA 108, 4680–4687 (2011).

    ADS  CAS  PubMed  Google Scholar 

  47. Koren, O. et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9, e1002863 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  48. Knights, D. et al. Rethinking ‘enterotypes’. Cell Host Microbe 16, 433–437 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  49. Costea, P. I. et al. Enterotypes in the landscape of gut microbial community composition. Nat. Microbiol. 3, 8–16 (2018).

    CAS  PubMed  Google Scholar 

  50. Gao, L. L., Bien, J. & Witten, D. Selective inference for hierarchical clustering. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2022.2116331 (2022).

  51. Karcher, N. et al. Analysis of 1321 Eubacterium rectale genomes from metagenomes uncovers complex phylogeographic population structure and subspecies functional adaptations. Genome Biol. 21, 138 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  52. Hamady, M. & Knight, R. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res 19, 1141–1152 (2009).

    CAS  PubMed  PubMed Central  Google Scholar 

  53. Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

    CAS  PubMed  Google Scholar 

  54. Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016).

    PubMed  PubMed Central  Google Scholar 

  55. Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 1–14 (2019).

    Google Scholar 

  56. Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl Acad. Sci. USA 102, 2567–2572 (2005).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  57. Nguyen, N.-P., Warnow, T., Pop, M. & White, B. A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. NPJ Biofilms Microbiomes 2, 16004 (2016).

    PubMed  PubMed Central  Google Scholar 

  58. Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

    ADS  PubMed  PubMed Central  Google Scholar 

  59. Murray, C. S., Gao, Y. & Wu, M. Re-evaluating the evidence for a universal genetic boundary among microbial species. Nat. Commun. 12, 4059 (2021).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  60. Rodriguez-R, L. M., Jain, C., Conrad, R. E., Aluru, S. & Konstantinidis, K. T. Reply to: ‘Re-evaluating the evidence for a universal genetic boundary among microbial species’. Nat. Commun. 12, 4060 (2021).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  61. Li, W. & Godzik, A. cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).

    CAS  PubMed  Google Scholar 

  62. Bahram, M. et al. Structure and function of the global topsoil microbiome. Nature 560, 233–237 (2018).

    ADS  CAS  PubMed  Google Scholar 

  63. Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  64. Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

    ADS  Google Scholar 

  65. Xiao, L. et al. A catalog of the mouse gut metagenome. Nat. Biotechnol. 33, 1103–1108 (2015).

    CAS  PubMed  Google Scholar 

  66. Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    CAS  PubMed  PubMed Central  Google Scholar 

  67. Chen, C. et al. Expanded catalog of microbial genes and metagenome-assembled genomes from the pig gut microbiome. Nat. Commun. 12, 1106 (2021).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  68. Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).

    CAS  PubMed  Google Scholar 

  69. Vanni, C. et al. Unifying the known and unknown microbial coding sequence space. eLife 11, e67667 (2022).

    PubMed  PubMed Central  Google Scholar 

  70. Apweiler, R. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).

    CAS  PubMed  PubMed Central  Google Scholar 

  71. Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).

    CAS  PubMed  Google Scholar 

  72. Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459 (2010).

    Google Scholar 

  73. Davis, T. D., Gerry, C. J. & Tan, D. S. General platform for systematic quantitative evaluation of small-molecule permeability in bacteria. ACS Chem. Biol. 9, 2535–2544 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  74. Suchodolski, J. S. et al. The fecal microbiome in dogs with acute diarrhea and idiopathic inflammatory bowel disease. PLoS ONE 7, e51907 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  75. Mishiro, T. et al. Oral microbiome alterations of healthy volunteers with proton pump inhibitor. J. Gastroenterol. Hepatol. 33, 1059–1066 (2018).

    CAS  PubMed  Google Scholar 

  76. Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A. & Knight, R. EMPeror: a tool for visualizing high-throughput microbial community data. Gigascience 2, 16 (2013).

    PubMed  PubMed Central  Google Scholar 

  77. van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).

    Google Scholar 

  78. Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).

    Google Scholar 

  79. Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).

    CAS  PubMed  PubMed Central  Google Scholar 

  80. Kuchina, A. et al. Microbial single-cell RNA sequencing by split-pool barcoding. Science 371, eaba5257 (2021).

    CAS  PubMed  Google Scholar 

  81. Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  82. Rousk, J. et al. Soil bacterial and fungal communities across a pH gradient in an arable soil. ISME J. 4, 1340–1351 (2010).

    PubMed  Google Scholar 

  83. Aagaard, K. et al. A metagenomic approach to characterization of the vaginal microbiome signature in pregnancy. PLoS ONE 7, e36466 (2012).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  84. Blattman, S. B., Jiang, W., Oikonomou, P. & Tavazoie, S. Prokaryotic single-cell RNA sequencing by in situ combinatorial indexing. Nat. Microbiol. 5, 1192–1201 (2020).

    CAS  PubMed  PubMed Central  Google Scholar 

  85. Jeckel, H. & Drescher, K. Advances and opportunities in image analysis of bacterial cells and communities. FEMS Microbiol. Rev. 45, fuaa062 (2020).

    PubMed Central  Google Scholar 

  86. Geier, B. et al. Spatial metabolomics of in situ host–microbe interactions at the micrometre scale. Nat. Microbiol. 5, 498–510 (2020).

    CAS  PubMed  Google Scholar 

  87. Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).

    PubMed  Google Scholar 

  88. Li, H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2, 73–94 (2015).

    Google Scholar 

  89. Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 8, 2224 (2017).

    PubMed  PubMed Central  Google Scholar 

  90. Bermingham, M. L. et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep. 5, 10312 (2015).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  91. Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

    PubMed  PubMed Central  Google Scholar 

  92. Zackular, J. P., Rogers, M. A. M., Ruffin, M. T. 4th & Schloss, P. D. The human gut microbiome as a screening tool for colorectal cancer. Cancer Prev. Res. 7, 1112–1121 (2014).

    CAS  Google Scholar 

  93. Wong, S. H. et al. Quantitation of faecal Fusobacterium improves faecal immunochemical test in detecting advanced colorectal neoplasia. Gut 66, 1441–1448 (2017).

    CAS  PubMed  Google Scholar 

  94. Xie, Y.-H. et al. Fecal Clostridium symbiosum for noninvasive detection of early and advanced colorectal cancer: test and validation studies. EBioMedicine 25, 32–40 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  95. Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  96. Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).

    CAS  PubMed  PubMed Central  Google Scholar 

  97. Bourgon, R., Gentleman, R. & Huber, W. Independent filtering increases detection power for high-throughput experiments. Proc. Natl Acad. Sci. USA 107, 9546–9551 (2010).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  98. Hua, J., Tembe, W. D. & Dougherty, E. R. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognit. 42, 409–424 (2009).

    ADS  Google Scholar 

  99. Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–911 (2008).

    MathSciNet  Google Scholar 

  100. Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).

    Google Scholar 

  101. Radovic, M., Ghalwash, M., Filipovic, N. & Obradovic, Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics 18, 9 (2017).

    PubMed  PubMed Central  Google Scholar 

  102. Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015). This study underlines the importance of considering the influence of medication in machine learning-based microbiome analysis. In particular, it shows the effects of metformin on the gut microbiome of individuals with type 2 diabetes, highlighting the need to distinguish microbial signatures of diseases from medication.

    CAS  PubMed  PubMed Central  Google Scholar 

  103. Hacılar, H., Nalbantoğlu, O. U. & Bakir-Güngör, B. in 2018 3rd Int. Conf. Computer Science and Engineering (UBMK) 434–438 (IEEE, 2018).

  104. Flemer, B. et al. The oral microbiota in colorectal cancer is distinctive and predictive. Gut 67, 1454–1463 (2018).

    CAS  PubMed  Google Scholar 

  105. Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat. Med. 25, 968–976 (2019).

    CAS  PubMed  Google Scholar 

  106. Maimon, O. & Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook (Springer, 2010).

  107. Lever, J., Krzywinski, M. & Altman, N. Model selection and overfitting. Nat. Methods 13, 703–704 (2016). This work highlights the importance of accurately assessing model performance to not fall into overfitting problems. Approaches that consider validation sets, test sets and cross-validation are extremely important especially when dealing with limited data.

    CAS  Google Scholar 

  108. Lever, J., Krzywinski, M. & Altman, N. Classification evaluation. Nat. Methods 13, 603–604 (2016). This work highlights the importance of selecting the appropriate evaluation metrics when assessing the performances of classification models in the context of medical diagnosis. It also emphasizes the impact of class imbalance and the use of specific metrics in cases of imbalanced data sets.

    CAS  Google Scholar 

  109. Ange, B. A., Symons, J. M., Schwab, M., Howell, E. & Geyh, A. Generalizability in epidemiology: an investigation within the context of heart failure studies. Ann. Epidemiol. 14, 600–601 (2004).

    Google Scholar 

  110. He, Y. et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat. Med. 24, 1532–1535 (2018).

    CAS  PubMed  Google Scholar 

  111. Renson, A. et al. Sociodemographic variation in the oral microbiome. Ann. Epidemiol. 35, 73–80.e2 (2019).

    PubMed  PubMed Central  Google Scholar 

  112. Sinha, R. et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35, 1077–1086 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  113. Soneson, C., Gerster, S. & Delorenzi, M. Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation. PLoS ONE 9, e100335 (2014).

    ADS  PubMed  PubMed Central  Google Scholar 

  114. Riester, M. et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J. Natl Cancer Inst. 106, dju048 (2014).

    PubMed  PubMed Central  Google Scholar 

  115. Zhang, Y., Bernau, C., Parmigiani, G. & Waldron, L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 21, 253–268 (2018). This work examines the impact of different types of heterogeneity on the validation accuracy of omics-based prediction models across data sets and provides insights into the challenges of validating prediction models in the presence of study heterogeneity.

    MathSciNet  PubMed Central  Google Scholar 

  116. Bernau, C. et al. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30, i105–i112 (2014).

    CAS  PubMed  PubMed Central  Google Scholar 

  117. Moreno-Indias, I. et al. Statistical and machine learning techniques in human microbiome studies: contemporary challenges and solutions. Front. Microbiol. 12, 635781 (2021). This work highlights the growing importance of statistical and machine learning techniques in human microbiome studies and challenges posed by the heterogeneity of microbiome data, and emphasizes the potential of machine learning in disease diagnosis, biomarker identification and prediction while addressing issues such as data standardization, overfitting and model interpretability.

    PubMed  PubMed Central  Google Scholar 

  118. Tonkovic, P. et al. Literature on applied machine learning in metagenomic classification: a scoping review. Biology 9, 453 (2020).

    PubMed  PubMed Central  Google Scholar 

  119. Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).

    ADS  CAS  PubMed  Google Scholar 

  120. Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023 (2017).

    CAS  PubMed  PubMed Central  Google Scholar 

  121. Méheust, R., Burstein, D., Castelle, C. J. & Banfield, J. F. The distinction of CPR bacteria from other bacteria based on protein family content. Nat. Commun. 10, 4173 (2019).

    ADS  PubMed  PubMed Central  Google Scholar 

  122. Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature 523, 208–211 (2015).

    ADS  CAS  PubMed  Google Scholar 

  123. Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  124. Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).

    CAS  PubMed  Google Scholar 

  125. Probst, A. J. et al. Genomic resolution of a cold subsurface aquifer community provides metabolic insights for novel microbes adapted to high CO2 concentrations. Environ. Microbiol. 19, 459–474 (2017).

    CAS  PubMed  Google Scholar 

  126. Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

    CAS  PubMed  Google Scholar 

  127. Eid, F.-E., ElHefnawi, M. & Heath, L. S. DeNovo: virus–host sequence-based protein–protein interaction prediction. Bioinformatics 32, 1144–1150 (2015).

    PubMed  Google Scholar 

  128. Calderone, A., Licata, L. & Cesareni, G. VirusMentha: a new resource for virus–host protein interactions. Nucleic Acids Res. 43, D588–D592 (2015).

    CAS  PubMed  Google Scholar 

  129. Weis, C. et al. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat. Med. 28, 164–174 (2022).

    CAS  PubMed  Google Scholar 

  130. Wirbel, J. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol. 22, 93 (2021).

    PubMed  PubMed Central  Google Scholar 

  131. Vujkovic-Cvijin, I. et al. Host variables confound gut microbiota studies of human disease. Nature 587, 448–454 (2020).

    ADS  CAS  PubMed  PubMed Central  Google Scholar 

  132. Hernán, M. A. The C-word: scientific euphemisms do not improve causal inference from observational data. Am. J. Public. Health 108, 616–619 (2018). This work emphasizes the importance of using the term ‘causal’, in particular when analysing data from observational studies, and highlights the need to distinguish between association and causation and address confounding factors properly.

    PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

The authors thank all members of the Computational Metagenomics laboratory, the Waldron laboratory and the Structured Machine Learning Group for their feedback and suggestions. This work was supported by the European Research Council (ERC-STG project MetaPG-716575 and ERC-CoG microTOUCH-101045015) to N.S., by the European Union’s Horizon 2020 programme (IHMCSA-964590) to N.S. and F.A., and by the European Union under NextGenerationEU (Interconnected Nord-Est Innovation programme (INEST)) to N.S. Views and opinions expressed are those of the author(s) only and do not necessarily reflect those of the European Union or The European Research Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

Author information

Authors and Affiliations

Authors

Contributions

N.S., F.A. and A.M.T. contributed equally to all aspects of the article. A.P. contributed substantially to discussion of the content and reviewed and/or edited the manuscript before submission. L.W. contributed substantially to discussion of the content, writing, and review and/or editing of the manuscript before submission.

Corresponding authors

Correspondence to Levi Waldron or Nicola Segata.

Ethics declarations

Competing interests

The authors declare no competing interests.

Peer review

Peer review information

Nature Reviews Microbiology thanks Elhanan Borenstein and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Glossary

Accuracy

The number of correct classification predictions (true positives + true negatives) divided by the total number of predictions (true positives + true negatives + false positives + false negatives).

Area under the ROC curve

(AUC-ROC). A number between 0 and 1 that is obtained by integrating the receiver operating characteristic (ROC) curve over the different classification thresholds and that represents the ability of a binary classification model to discriminate between two classes, where 0.5 and 1 represent the random and perfect classification of the samples, respectively.

Cross-validation

An approach to provide robust performance estimates of how well the trained model generalizes on new data by splitting a data set into multiple subsets and iteratively training on some subsets and testing on the others.

Data set

A set of examples with input features and target values (if available), used to train and/or evaluate machine learning models, that can be divided into three non-overlapping subsets: training, validation and test sets. It is crucial to ensure that the same example is not present in both training and test (or validation) sets for a correct estimate of the generalizability of the learned model.

Decision tree

A non-parametric supervised learning method with a hierarchical tree structure to represent a set of if–then–else rules for different conditions. The internal nodes define conditions, and the leaves represent outputs.

Example

A processed version of the microbiological sample, including features and, possibly, targets.

Features

The microbiological data information extracted from the samples that are provided as input to the machine learning model.

Least absolute shrinkage and selection operator

(LASSO). A linear model approach that performs both variable selection and regularization (stabilization of regression coefficients) and tends to give solutions with few non-zero coefficients, to reduce the number of features and enhance the interpretability of the model.

Leave-one-data set-out

(LODO). An approach used to estimate model generalizability across data sets, that can be employed if multiple different data sets are available.

Model

A mathematical object with appropriately set parameters used to make predictions.

Naive Bayes

A supervised learning algorithm based on the application of Bayes’ theorem with the ‘naive’ assumption that all features are independent.

Neural network

A model with at least one hidden layer, a set of unobserved variables called ‘neurons’ derived from input features. Deep neural networks contain at least two hidden layers, where each neuron in a hidden layer connects to all the neurons of the next hidden layer. Combining many hidden layers and their interconnections enable modelling complex and non-linear relationships between input features and target values.

Precision

A metric for classification models that measures the fraction of true positive examples over the set of examples predicted as positives (true positives / (true positives + false positives)).

Random forest

An ensemble method that relies on a collection of independently trained decision tree models whose predictions are then aggregated to make one single prediction.

Recall

A metric for classification models that measures the fraction of true positive examples over the set of positive examples, also known as coverage (true positives / (true positives + false negatives)).

Receiver operating characteristic (ROC) curve

Generally plotted as a graph between the true positive rate and the false positive rate at different classification thresholds for evaluating a binary classification model, the curve’s shape reflects the ability of the binary classification model to separate the two classes.

Samples

Original items, for example microbiological entities, from which features data and target values are derived.

Supervised machine learning

An algorithm that trains a model to predict the target based on input features, resulting in a trained model capable of classifying new and unseen samples using the same set of features.

Support vector machines

(SVMs). A set of supervised learning prediction methods based on statistical learning theory that aims to maximize the boundary between the positive and negative classes.

Target value

A priori defined classes or quantities of microbiological interest (for example, case or control labels, Gram positive or negative staining, optimal pH values for bacterial growth) associated with examples, that are available only at training time and need to be predicted at test time from the features alone.

Test set

The (sub)set of a data set used for the final evaluation of the trained model or for which the outcomes of interest are not known and should be predicted by the trained model.

Training set

The (sub)set of a data set that is used for training a machine learning model.

Unsupervised machine learning

An algorithm that trains a model based solely on input features to derive patterns without further knowledge about the samples from which features were extracted.

Validation set

The (sub)set of a data set used to evaluate a trained model.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asnicar, F., Thomas, A.M., Passerini, A. et al. Machine learning for microbiologists. Nat Rev Microbiol 22, 191–205 (2024). https://doi.org/10.1038/s41579-023-00984-1

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1038/s41579-023-00984-1

This article is cited by

Search

Quick links

Nature Briefing Microbiology

Sign up for the Nature Briefing: Microbiology newsletter — what matters in microbiology research, free to your inbox weekly.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing: Microbiology