Article | Published:

Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer

Nature Medicinevolume 25pages679689 (2019) | Download Citation


Association studies have linked microbiome alterations with many human diseases. However, they have not always reported consistent results, thereby necessitating cross-study comparisons. Here, a meta-analysis of eight geographically and technically diverse fecal shotgun metagenomic studies of colorectal cancer (CRC, n = 768), which was controlled for several confounders, identified a core set of 29 species significantly enriched in CRC metagenomes (false discovery rate (FDR) < 1 × 10−5). CRC signatures derived from single studies maintained their accuracy in other studies. By training on multiple studies, we improved detection accuracy and disease specificity for CRC. Functional analysis of CRC metagenomes revealed enriched protein and mucin catabolism genes and depleted carbohydrate degradation genes. Moreover, we inferred elevated production of secondary bile acids from CRC metagenomes, suggesting a metabolic link between cancer-associated gut microbes and a fat- and meat-rich diet. Through extensive validations, this meta-analysis firmly establishes globally generalizable, predictive taxonomic and functional microbiome CRC signatures as a basis for future diagnostics.

Access optionsAccess options

Rent or Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Data availability

The raw sequencing data for the samples in the German study that have not been published before (see Methods) are available from the European Nucleotide Archive under study no. PRJEB27928. The metadata for these samples are available as Supplementary Table 6.

For the other studies included in the current study, the raw sequencing data can be found under the following European Nucleotide Archive identifiers: PRJEB10878 for Yu et al.11; PRJEB12449 for Vogtmann et al.10; ERP008729 for Feng et al.9; and ERP005534 for Zeller et al.8. The independent validation cohorts can be found in the Sequence Read Archive under the identifier no. SRP136711 for Thomas et al.27 and in the DNA Data Bank of Japan database under identification no. DRA006684.

The filtered taxonomic and functional profiles used as input for the statistical modeling pipeline are available in Supplementary Data 1.

The code and all analysis results can be found under

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.


  1. 1.

    Tringe, S. G. & Rubin, E. M. Metagenomics: DNA sequencing of environmental samples. Nat. Rev. Genet. 6, 805–814 (2005).

  2. 2.

    Tremaroli, V. & Bäckhed, F. Functional interactions between the gut microbiota and host metabolism. Nature 489, 242–249 (2012).

  3. 3.

    Lynch, S. V. & Pedersen, O. The human intestinal microbiome in health and disease. N. Engl. J. Med. 375, 2369–2379 (2016).

  4. 4.

    Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

  5. 5.

    Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

  6. 6.

    Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

  7. 7.

    Schirmer, M. et al. Dynamics of metatranscription in the inflammatory bowel disease gut microbiome. Nat. Microbiol. 3, 337–346 (2018).

  8. 8.

    Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

  9. 9.

    Feng, Q. et al. Gut microbiome development along the colorectal adenoma-carcinoma sequence. Nat. Commun. 6, 6528 (2015).

  10. 10.

    Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS ONE 11, e0155362 (2016).

  11. 11.

    Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

  12. 12.

    Bedarf, J. R. et al. Functional implications of microbial and viral gut metagenome changes in early stage L-DOPA-naïve Parkinson’s disease patients. Genome Med. 9, 39 (2017).

  13. 13.

    Schmidt, T. S. B., Raes, J. & Bork, P. The human gut microbiome: from association to modulation. Cell 172, 1198–1215 (2018).

  14. 14.

    Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015).

  15. 15.

    Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).

  16. 16.

    Lozupone, C. A. et al. Meta-analyses of studies of the human microbiota. Genome Res. 23, 1704–1714 (2013).

  17. 17.

    Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).

  18. 18.

    Shah, M. S. et al. Leveraging sequence-based faecal microbial community survey data to identify a composite biomarker for colorectal cancer. Gut 67, 882–891 (2018).

  19. 19.

    Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).

  20. 20.

    Dai, Z. et al. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018).

  21. 21.

    Maier, L. et al. Extensive impact of non-antibiotic drugs on human gut bacteria. Nature 555, 623–628 (2018).

  22. 22.

    Milanese, M. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).

  23. 23.

    Kultima, J. R. et al. MOCAT2: a metagenomic assembly, annotation and profiling framework. Bioinformatics 32, 2520–2523 (2016).

  24. 24.

    Hothorn, T. et al. A Lego system for conditional inference. Am. Stat. 60, 257–263 (2006).

  25. 25.

    Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663 (2015).

  26. 26.

    Tjalsma, H., Boleij, A., Marchesi, J. R. & Dutilh, B. E. A bacterial driver-passenger model for colorectal cancer: beyond the usual suspects. Nat. Rev. Microbiol. 10, 575–582 (2012).

  27. 27.

    Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. (2019).

  28. 28.

    Huerta-Cepas, J. et al.eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 44, D286–D293 (2016).

  29. 29.

    Kanehisa, M. et al. Data, information, knowledge and principle: back to metabolism in KEGG. Nucleic Acids Res. 42, D199–D205 (2014).

  30. 30.

    Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).

  31. 31.

    Vieira-Silva, S. et al. Species–function relationships shape ecological properties of the human gut microbiome. Nat. Microbiol. 1, 16088 (2016).

  32. 32.

    Hirayama, A. et al. Quantitative metabolome profiling of colon and stomach cancer microenvironment by capillary electrophoresis time-of-flight mass spectrometry. Cancer Res. 69, 4918–4925 (2009).

  33. 33.

    Denkert, C. et al. Metabolite profiling of human colon carcinoma: deregulation of TCA cycle and amino acid turnover. Mol. Cancer 7, 72 (2008).

  34. 34.

    Mal, M., Koh, P. K., Cheah, P. Y. & Chan, E. C. Metabotyping of human colorectal cancer using two-dimensional gas chromatography mass spectrometry. Anal. Bioanal. Chem. 403, 483–493 (2012).

  35. 35.

    Weir, T. L. et al. Stool microbiome and metabolome differences between colorectal cancer patients and healthy adults. PLoS ONE 8, e70803 (2013).

  36. 36.

    Goedert, J. J. et al. Fecal metabolomics: assay performance and association with colorectal cancer. Carcinogenesis 35, 2089–2096 (2014).

  37. 37.

    Aykan, N. F. Red meat and colorectal cancer. Oncol. Rev. 9, 288 (2015).

  38. 38.

    Diet, Nutrition, Physical Activity and Cancer: a Global Perspective. A Summary of the Third Expert Report (World Cancer Research Fund, 2018).

  39. 39.

    Dutilh, B. E., Backus, L., van Hijum, S. A. & Tjalsma, H. Screening metatranscriptomes for toxin genes as functional drivers of human colorectal cancer. Best Pract. Res. Clin. Gastroenterol. 27, 85–99 (2013).

  40. 40.

    Sears, C. L. & Garrett, W. S. Microbes, microbiota, and colon cancer. Cell Host Microbe 15, 317–328 (2014).

  41. 41.

    Ridlon, J. M., Harris, S. C., Bhowmik, S., Kang, D. J. & Hylemon, P. B. Consequences of bile salt biotransformations by intestinal bacteria. Gut Microbes 7, 22–39 (2016).

  42. 42.

    Yoshimoto, S. et al. Obesity-induced gut microbial metabolite promotes liver cancer through senescence secretome. Nature 499, 97–101 (2013).

  43. 43.

    Ajouz, H., Mukherji, D. & Shamseddine, A. Secondary bile acids: an underrecognized cause of colon cancer. World J. Surg. Oncol. 12, 164 (2014).

  44. 44.

    Boleij, A. et al. The Bacteroides fragilis toxin gene is prevalent in the colon mucosa of colorectal cancer patients. Clin. Infect. Dis. 60, 208–215 (2015).

  45. 45.

    Wu, S. et al. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat. Med. 15, 1016–1022 (2009).

  46. 46.

    Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592–597 (2018).

  47. 47.

    Ridlon, J. M., Kang, D. J. & Hylemon, P. B. Isolation and characterization of a bile acid inducible 7α-dehydroxylating operon in Clostridium hylemonae TN271. Anaerobe 16, 137–146 (2010).

  48. 48.

    Mallonee, D. H., White, W. B. & Hylemon, P. B. Cloning and sequencing of a bile acid-inducible operon from Eubacterium sp. strain VPI 12708. J. Bacteriol. 172, 7011–7019 (1990).

  49. 49.

    Ocvirk, S. & O’Keefe, S. J. D. Influence of bile acids on colorectal cancer risk: potential mechanisms mediated by diet–gut microbiota interactions. Curr. Nutr. Rep. 6, 315–322 (2017).

  50. 50.

    Gevers, D. et al. The treatment-naive microbiome in new-onset Crohn’s disease. Cell Host Microbe 15, 382–392 (2014).

  51. 51.

    Viennot, S. et al. Colon cancer in inflammatory bowel disease: recent trends, questions and answers. Gastroenterol. Clin. Biol. 33, S190–S201 (2009).

  52. 52.

    Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).

  53. 53.

    Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).

  54. 54.

    Arthur, J. C. et al. Intestinal inflammation targets cancer-inducing activity of the microbiota. Science 338, 120–123 (2012).

  55. 55.

    Reddy, B. S. Diet and excretion of bile acids. Cancer Res. 41, 3766–3768 (1981).

  56. 56.

    Ogino, S. et al. Integrative analysis of exogenous, endogenous, tumour and immune factors for precision medicine. Gut 67, 1168–1180 (2018).

  57. 57.

    Ogino, S., Chan, A. T., Fuchs, C. S. & Giovannucci, E. Molecular pathological epidemiology of colorectal neoplasia: an emerging transdisciplinary and interdisciplinary field. Gut 60, 397–411 (2011).

  58. 58.

    Hannigan, G. D., Duhaime, M. B., Ruffin, M. T. 4th, Koumpouras, C. C. & Schloss, P. D. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio 9, e02248-18 (2018).

  59. 59.

    zur Hausen, H. Red meat consumption and cancer: reasons to suspect involvement of bovine infectious factors in colorectal cancer. Int. J. Cancer 130, 2475–2483 (2012).

  60. 60.

    Shkoporov, A. N. et al. Reproducible protocols for metagenomic analysis of human faecal phageomes. Microbiome 6, 68 (2018).

  61. 61.

    Böhm, J. et al. Discovery of novel plasma proteins as biomarkers for the development of incisional hernias after midline incision in patients with colorectal cancer: The ColoCare study. Surgery 161, 808–817 (2017).

  62. 62.

    Liesenfeld, D. B. et al. Metabolomics and transcriptomics identify pathway differences between visceral and subcutaneous adipose tissue in colorectal cancer patients: the ColoCare study. Am. J. Clin. Nutr. 102, 433–443 (2015).

  63. 63.

    Pox, C. P. et al. Efficacy of a nationwide screening colonoscopy program for colorectal cancer. Gastroenterology 142, 1460–1467.e2 (2012).

  64. 64.

    Furet, J. P. et al. Comparative assessment of human and farm animal faecal microbiota using real-time quantitative PCR. FEMS Microbiol. Ecol. 68, 351–362 (2009).

  65. 65.

    Mende, D. R. et al. Accurate and universal delineation of prokaryotic species. Nat. Methods 10, 881–884 (2013).

  66. 66.

    Sunagawa, S. et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat. Methods 10, 1196–1199 (2013).

  67. 67.

    Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics 25, 1754–1760 (2009).

  68. 68.

    Tibshirani, R. Regression shrinkage and selection via the LASSO. J. R. Stat. Soc. B 58, 267–288 (1996).

  69. 69.

    Smialowski, P., Frishman, D. & Kramer, S. Pitfalls of supervised feature selection. Bioinformatics 26, 440–443 (2010).

  70. 70.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. B 57, 289–300 (1995).

  71. 71.

    Robin, X. et al. pROC: an open-source package for R and S+ to analyze and compare ROC curves. BMC Bioinformatics 12, 77 (2011).

  72. 72.

    Oksanen, J. et al. vegan: Community Ecology Package (The Comprehensive R Archive Network, 2018).

  73. 73.

    Costea, P. I., Zeller, G., Sunagawa, S. & Bork, P. A fair comparison. Nat. Methods 11, 359 (2014).

  74. 74.

    Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction (Springer, 2009).

  75. 75.

    Peng, H., Long, F. & Ding, C. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 27, 1226–1238 (2005).

  76. 76.

    Edgar, R. C. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32, 1792–1797 (2004).

  77. 77.

    Altschul, S. F. et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402 (1997).

  78. 78.

    Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).

  79. 79.

    Rice, P., Longden, I. & Bleasby, A. EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet. 16, 276–277 (2000).

  80. 80.

    Caporaso, J. G. et al. Global patterns of 16S rRNA diversity at a depth of millions of sequences per sample. Proc. Natl Acad. Sci. USA 108, 4516–4522 (2011).

Download references


We are thankful to members of the Zeller, Bork, and Arumugam groups for inspiring discussions. Additionally, we thank Y. P. Yuan and the EMBL Information Technology Core Facility for support with high-performance computing, and the EMBL Genomics Core Facility for their sequencing support. We are also grateful for the advice provided by B. Klaus, EMBL Centre for Statistical Data Analysis. We acknowledge funding from EMBL, the German Cancer Research Center, the Huntsman Cancer Foundation, the Intramural Research Program of the National Cancer Institute, ETH Zürich, and the following external sources: the European Research Council (CancerBiome grant no. ERC-2010-AdG_20100317 to P.B., Microbios grant no. ERC-AdG-669830 to P.B., and Meta-PG grant no. ERC-2016-STG-716575 to N.S.); the Novo Nordisk Foundation (grant no. NNF10CC1016515 to M.A.); the Danish Diabetes Academy supported by the Novo Nordisk Foundation and TARGET Research Initiative (Danish Strategic Research Council grant no. 0603-00484B to M.A.); the Matthias-Lackas Foundation (to C.M.U.); the National Cancer Institute (grant nos. R01 CA189184, R01 CA207371, U01 CA206110, and P30 CA042014 to C.M.U.); the Federal Ministry of Education and Research (BMBF; the de.NBI network no. 031A537B to P.B. and the ERA-NET TRANSCAN project no. 01KT1503 to C.M.U.); the Helmut Horten Foundation (to S.Sunagawa); and the Fundação de Amparo à Pesquisa do Estado de São Paulo (grant no. 16/23527-2 to A.M.T.). For the Italy validation cohorts, funding was provided by the Lega Italiana per La Lotta contro i Tumori. For the Japan validation cohort, funding was provided to T.Y. and S.Y. by the National Cancer Center Research and Development Fund (grant nos. 25-A-4,28-A-4, and 29-A-6); Practical Research Project for Rare/Intractable Diseases from the Japan Agency for Medical Research and Development (grant no. JP18ek0109187); Japan Science and Technology Agency-PRESTO (grant no. JPMJPR1507); Japan Society for the Promotion of Science KAKENHI (grant nos. 16J10135, 142558, and 221S0002); Joint Research Project of the Institute of Medical Science, University of Tokyo; and the Takeda Science and Suzuken Memorial Foundations.

Author information

Author notes

    • Luis Pedro Coelho

    Present address: Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China

  1. These authors contributed equally: Jakob Wirbel, Paul Theodor Pyl.

  2. These authors jointly supervised this work: Manimozhiyan Arumugam, Peer Bork, Georg Zeller.


  1. Structural and Computational Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany

    • Jakob Wirbel
    • , Ece Kartal
    • , Konrad Zych
    • , Alessio Milanese
    • , Jonas S. Fleck
    • , Anita Y. Voigt
    • , Ruby Ponnudurai
    • , Shinichi Sunagawa
    • , Luis Pedro Coelho
    • , Peer Bork
    •  & Georg Zeller
  2. Novo Nordisk Foundation Center for Basic Metabolic Research, Faculty of Health and Medicine, University of Copenhagen, Copenhagen, Denmark

    • Paul Theodor Pyl
    • , Alireza Kashani
    • , Albert Palleja
    •  & Manimozhiyan Arumugam
  3. Division of Surgery, Oncology and Pathology, Department of Clinical Sciences Lund, Faculty of Medicine, Lund University, Lund, Sweden

    • Paul Theodor Pyl
    •  & Emma Niméus
  4. Molecular Medicine Partnership Unit, Heidelberg, Germany

    • Ece Kartal
    •  & Peer Bork
  5. The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA

    • Anita Y. Voigt
  6. Department of Biology, ETH Zürich, Zürich, Switzerland

    • Shinichi Sunagawa
  7. Division of Preventive Oncology, National Center for Tumor Diseases and German Cancer Research Center, Heidelberg, Germany

    • Petra Schrotz-King
    •  & Hermann Brenner
  8. Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA

    • Emily Vogtmann
    •  & Rashmi Sinha
  9. Genome Biology Unit, European Molecular Biology Laboratory, Heidelberg, Germany

    • Nina Habermann
  10. Division of Surgery, Department of Clinical Sciences Lund, Faculty of Medicine, Skane University Hospital, Lund, Sweden

    • Emma Niméus
  11. Department CIBIO, University of Trento, Trento, Italy

    • Andrew M. Thomas
    • , Paolo Manghi
    •  & Nicola Segata
  12. Biochemistry Department, Chemistry Institute, University of São Paulo, São Paulo, Brazil

    • Andrew M. Thomas
  13. IEO, European Institute of Oncology IRCCS, Milan, Italy

    • Sara Gandini
    •  & Davide Serrano
  14. School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, Japan

    • Sayaka Mizutani
    • , Hirotsugu Shiroma
    •  & Takuji Yamada
  15. Research Fellow of Japan Society for the Promotion of Science, Tokyo, Japan

    • Sayaka Mizutani
  16. Division of Cancer Genomics, National Cancer Center Research Institute, Tokyo, Japan

    • Satoshi Shiba
    • , Tatsuhiro Shibata
    •  & Shinichi Yachida
  17. Laboratory of Molecular Medicine, Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan

    • Tatsuhiro Shibata
  18. Department of Cancer Genome Informatics, Graduate School of Medicine/Faculty of Medicine, Osaka University, Osaka, Japan

    • Shinichi Yachida
  19. PRESTO, Japan Science and Technology Agency, Saitama, Japan

    • Takuji Yamada
  20. Graduate School of Public Health and Health Policy, City University of New York, New York, NY, USA

    • Levi Waldron
  21. Institute for Implementation Science in Population Health, City University of New York, New York, NY, USA

    • Levi Waldron
  22. Italian Institute for Genomic Medicine, Turin, Italy

    • Alessio Naccarati
  23. Department of Molecular Biology of Cancer, Institute of Experimental Medicine, Prague, Czech Republic

    • Alessio Naccarati
  24. Huntsman Cancer Institute and Department of Population Health Sciences, University of Utah, Salt Lake City, UT, USA

    • Cornelia M. Ulrich
  25. Division of Clinical Epidemiology and Aging Research, German Cancer Research Center, Heidelberg, Germany

    • Hermann Brenner
  26. German Cancer Consortium, German Cancer Research Center, Heidelberg, Germany

    • Hermann Brenner
  27. Faculty of Healthy Sciences, University of Southern Denmark, Odense, Denmark

    • Manimozhiyan Arumugam
  28. Max Delbrück Centre for Molecular Medicine, Berlin, Germany

    • Peer Bork
  29. Department of Bioinformatics, Biocenter, University of Würzburg, Würzburg, Germany

    • Peer Bork


  1. Search for Jakob Wirbel in:

  2. Search for Paul Theodor Pyl in:

  3. Search for Ece Kartal in:

  4. Search for Konrad Zych in:

  5. Search for Alireza Kashani in:

  6. Search for Alessio Milanese in:

  7. Search for Jonas S. Fleck in:

  8. Search for Anita Y. Voigt in:

  9. Search for Albert Palleja in:

  10. Search for Ruby Ponnudurai in:

  11. Search for Shinichi Sunagawa in:

  12. Search for Luis Pedro Coelho in:

  13. Search for Petra Schrotz-King in:

  14. Search for Emily Vogtmann in:

  15. Search for Nina Habermann in:

  16. Search for Emma Niméus in:

  17. Search for Andrew M. Thomas in:

  18. Search for Paolo Manghi in:

  19. Search for Sara Gandini in:

  20. Search for Davide Serrano in:

  21. Search for Sayaka Mizutani in:

  22. Search for Hirotsugu Shiroma in:

  23. Search for Satoshi Shiba in:

  24. Search for Tatsuhiro Shibata in:

  25. Search for Shinichi Yachida in:

  26. Search for Takuji Yamada in:

  27. Search for Levi Waldron in:

  28. Search for Alessio Naccarati in:

  29. Search for Nicola Segata in:

  30. Search for Rashmi Sinha in:

  31. Search for Cornelia M. Ulrich in:

  32. Search for Hermann Brenner in:

  33. Search for Manimozhiyan Arumugam in:

  34. Search for Peer Bork in:

  35. Search for Georg Zeller in:


G.Z., M.A., and P.B. conceived and supervised the study. P.S.K., N.H., C.M.U., H.B., E.V., and R.S. recruited the participants and collected the samples. E.K., A.Y.V., S.Sunagawa, and P.B. generated the metagenomic data. A.M., P.T.P., J.S.F., A.P., S.Sunagawa, L.P.C., G.Z., and M.A. developed the metagenomic profiling workflows and/or performed the taxonomic and functional profiling. J.W., G.Z., K.Z., P.T.P., A.K., M.A., and N.S. performed the statistical analysis and/or developed the statistical analysis workflows. E.K. and R.P. designed and performed the validation experiments. A.M.T., P.M., S.G., D.S., S.M., H.S., S.Shiba, T.S., S.Y., T.Y., L.W., A.N., and N.S. provided additional validation data. J.W., G.Z., M.A., P.T.P., and P.B. designed the figures. G.Z., J.W., M.A., and P.B. wrote the manuscript with contributions from P.T.P., A.M., S.Sunagawa, L.P.C., E.K., A.Y.V., E.V., R.S., P.S.K., H.B., E.N., N.S. and L.W. All authors discussed and approved the manuscript.

Competing interest

P.B., G.Z., A.Y.V., and S.Sunagawa are named inventors on a patent (EP2955232A1: Method for diagnosing adenomas and/or colorectal cancer (CRC) based on analyzing the gut microbiome).

Corresponding authors

Correspondence to Manimozhiyan Arumugam or Peer Bork or Georg Zeller.

Extended data

  1. Extended Data Fig. 1 Potential confounding of individual microbial species associations by patient demographics and technical factors.

    Variance explained by disease status (CRC versus CTRL) is plotted against variance explained by different putative confounding factors for individual microbial species. Each species is represented by a dot proportional in size to its abundance (see legend and Methods); core microbial markers identified in the meta-analysis are highlighted in red. For the confounder analysis, factors with continuous values were discretized into quartiles and the BMI was split into lean/overweight/obese according to conventional cutoffs. The variance explained by disease status was computed for all data; accordingly, the x values are the same in all panels and also in Fig. 1d. The variance explained by different confounding factors was computed using all samples for which data were available (indicated by the insets). Source Data

  2. Extended Data Fig. 2 Study heterogeneity shows a strong influence on alpha and beta diversity.

    a, Alpha diversity as measured with the Shannon index was computed for all gut microbial species (n = 849), reference mOTUs (n = 246), and meta-mOTUs (n = 603) separately. P values were computed using a two-sided Wilcoxon test, while the overall P value (on top) was calculated using a two-sided blocked Wilcoxon test (n = 575 independent observations; see Methods). The ANOVA F-statistics below the panel was calculated using the R function ‘aov’. b, Principal coordinate analysis of samples from all five included studies based on Bray–Curtis distance; the study is color-coded and disease status (CRC versus CTRL) is indicated by filled/unfilled circles. The boxplots on the side and below show samples projected onto the first two principal coordinates broken down by study and disease status, respectively. P values were computed using a two-sided Wilcoxon test for disease status and a Kruskal–Wallis test for study (n = 575 independent observations). For all boxplots, boxes denote the IQR with the median as a thick black line and the whiskers extending up to the most extreme points within 1.5-fold IQR. Country codes are as in Fig. 1b. Source Data

  3. Extended Data Fig. 3 The generalized fold change extends the established (median-based) fold change to provide higher resolution in sparse microbiome data.

    a, In the top row, the logarithmic relative abundances for Bacteroides dorei/vulgatus, Parvimonas micra, and F. nucleatum subspecies animalis—examples for one high-prevalence and two low-prevalence species—are shown as swarm plots for the CTRL and CRC groups. The thick vertical lines indicate the medians in the different groups and the black horizontal line shows the difference between the two medians, which corresponds to the classical (median-based) fold change. Since F. nucleatum subspecies animalis is not detectable in more than 50% of cancer cases, there is no difference between the CTRL and CRC median; thus, the fold change is 0. The lower row shows the same data, but instead of only the median (or 50th percentile), 9 quantiles ranging from 10 to 90% are shown by the thinner vertical lines. The generalized fold change is indicated by the horizontal black line again, computed as the mean of the differences between the corresponding quantiles in both groups. In the case of the sparse data (for example, F. nucleatum), the differences in the 70, 80, and 90% quantiles cause the generalized fold change to be higher than 0. b, The median fold change is plotted against the newly developed generalized fold change for all microbial species. (The core set of microbial CRC marker species is highlighted in orange.) Marginal histograms visualize the distribution for both fold change and generalized fold change. c, Scatter plots showing the relationship between fold change and generalized fold change and the area under the ROC curve (AUROC) or the shift in prevalence between CRC and CTRL, with Spearman’s rank correlations (rho) added in the top left corners; the generalized fold change provides higher resolution (wider distribution around 0) and better correlation with the non-parametric AUROC effect size measure as well as prevalence shift, which captures the difference in prevalence of a species in CRC metagenomes relative to CTRL metagenomes. Source Data

  4. Extended Data Fig. 4 Microbial genera identified in the meta-analysis to be associated with CRC.

    a, The meta-analysis significance of microbial genera, computed using a univariate, two-sided Wilcoxon test blocked for ‘study’ and ‘colonoscopy’ (n = 574 independent observations), is given by bar height (FDR = 0.005). Underneath, significance (FDR-corrected P value computed using a two-sided Wilcoxon test) and generalized fold changes (see Methods) within individual studies are displayed as heatmaps in gray and color, respectively (see keys). Genera are ordered by meta-analysis significance and direction of change. b, For highly significant genera (meta-analysis FDR = 1 × 105), association strength is quantified by the area under the ROC curve across individual studies (color-coded diamonds); 95% confidence intervals are depicted by gray lines. Country codes are as in Fig. 1b. Source Data

  5. Extended Data Fig. 5 The core set of CRC-enriched microbial species can be stratified into four clusters based on co-occurrence in CRC metagenomes.

    a, The heatmap shows the Jaccard index (computed by comparing marker-positive samples; see Methods) for the core set of microbial marker species, computed on CRC cases only. Clustering was performed using the Ward algorithm as implemented in the R function ‘hclust’. The inset shows the distribution of Jaccard similarities within each cluster and for the background (all similarities between species not in the same cluster). b,c, Barplots show the fraction of CRC samples that are positive for a marker species cluster (defined as the union of positive marker species) broken down by patient subgroups based on BMI (b) and age (c) (see Fig. 2b–d for other patient subgroups). The significance of the associations between CRC subgroups and marker species clusters was tested using the Cochran–Mantel–Haenszel test blocked for ‘study’ and ‘colonoscopy’. (No significant associations were detected.) d, For the core set of microbial species with a genomic reference, the presence (red) or absence (white) of superoxide dismutase, peroxidase, and catalase are shown as heatmaps. Presence of the enzyme was determined by checking the protein annotations for the reference projects (see NCBI BioProject ID) in Source Data

  6. Extended Data Fig. 6 Coefficients of LOSO LASSO logistic regression models compared to models trained on individual studies.

    a, The mean coefficients (feature weights) from LASSO cross-validation models trained on single studies (color-coded) are plotted against the single-feature AUROC for each species feature. The horizontal lines highlight the microbial species that are—for at least one study—selected in more than 50% of the models in cross-validation and account for more than 10% of the absolute model weight in at least 10% of the cross-validation models. b, Similarly, b shows the same for the models trained in the LOSO setting (see Methods). The colors indicate which study has been left out of the training set (and is used for validation). The weights of the LOSO models are spread across more species; thus, generally, lower species are highlighted by the horizontal lines if their weights explain more than 2.5% of the absolute model in at least 10% of cross-validation models and they have been selected in more than 50% of models in cross-validation. c, The inset shows the distribution of the number of non-zero coefficients across all cross-validation models. d, The bar height indicates the number of non-zero coefficients that are shared between the mean models for each study or left-out study, respectively. e, The study-to-study difference (computed as the median of all pairwise differences between model weights for a single species across the mean models) for cross-validation single-study models are plotted against the same measure for the LOSO models. Species with a study-to-study difference of more than 0.02 in the cross-validation models are highlighted and annotated, showing much larger variability between models trained on single studies compared to LOSO models. Country codes as in Fig. 1b. Source Data

  7. Extended Data Fig. 7 Analysis of LOSO models for prediction bias.

    a, To examine whether species- and gene family-level classification models are confounded, that is, biased toward certain patient subgroups, the prediction scores from the LOSO models are broken down into strata for each clinical parameter (for example, female and male for sex). The prediction bias for each variable was tested by Wilcoxon (for sex and BMI) or Kruskal–Wallis (all others) tests while blocking for study as the confounder. The boxes denote the IQRs, with the median as the horizontal black line and the whiskers extending up to the most extreme point within the 1.5-fold IQR. A significant difference in prediction score was detected only for the CRC stage. This stage bias is more pronounced for gene family then for species models. b, To examine the CRC stage bias further, the barplots show the true positive rate corresponding to an overall 10% FPR (see also Fig. 3c) for the different CRC stages, displaying slightly higher classification sensitivity for late-stage CRC for both species and gene family models. Source Data

  8. Extended Data Fig. 8 Cross-study performance of statistical models based on KEGG KO abundances, single-gene abundances from the metagenomic gene catalog (IGC), and the combination of taxonomic and eggNOG database abundance profiles.

    ac, CRC classification accuracy resulting from cross-validation within each study (gray boxed along the diagonal) and study-to-study model transfer (external validations off the diagonal) as measured by the AUROC for the classification models trained on KEGG KOs (a), models based on the gene catalog (b), and models based on the combination of taxonomic and eggNOG database abundance profiles (c) (see Methods for the details on the statistical modeling workflows). The last column depicts the average AUROC across external validations. The barplots on the right show that the classification accuracy on a hold-out study improves if the data from all other studies are combined for training (LOSO validation) relative to models trained on data from a single study (study-to-study transfer, indicated by the bar color) consistently across different types of input data. Country codes as in Fig. 1b. Source Data

  9. Extended Data Fig. 9 Identification of bai genes in metagenomes.

    Putative bai genes identified in the metagenomic IGC were clustered by co-abundance in metagenomes to infer genomic linkage (see Methods) to be able to infer operon completeness and species of origin. a, For each resulting cluster of putative bile acid-converting genes, the mean relative abundance was plotted against the mean percentage of protein identity derived from global alignment against the know bile acid-converting genes from C. scindens and C. hylemonae (see Methods). Completeness, that is, how many of the 11 different bai gene functions are represented in each cluster, and mean gene-to-gene correlation of relative abundance within each cluster are encoded by dot size and color, respectively (see legend). The four clusters with a mean protein identity > 75% to known bai operon-containing genomes were included in the subsequent analysis and labeled with the most highly correlated mOTU (see b). b, Pearson correlation between gene cluster abundances and the relative abundance of the most highly correlated species (in logarithmic space) is given by the bar height for the four gene clusters identified in a. The most highly correlating species is highlighted in darker gray (see labeling of gene clusters in a). c, The log-transformed abundances of all bai genes and the four species identified in b are shown as boxplots for CTRLs (gray) and CRC cases (red). Assessing the significance of the differences between CRC and CTRLs (using a Wilcoxon test blocked for ‘study’ and ‘colonoscopy’) demonstrates a much more significant CRC enrichment of the aggregated metagenomic bai gene abundance than of the individual clostridial species to which these belong. d, ROC curve for the qPCR quantification of the baiF gene in the genomic DNA of a subset of samples in the German study (see Methods and Fig. 4e). Source Data

  10. Extended Data Fig. 10 Validation of the meta-analysis of single-species associations in three independent cohorts.

    a, Heatmap showing for the core set of CRC-associated species (see Fig. 1) the rank of the respective species within the associations of each study, including the three independent validation cohorts (see Table 1), compared to the rank in the meta-analysis (meta) on the left. b, Precision-recall curves for the different independent validation cohorts using the meta-analysis set of associated species at FDR = 0.005 (top) and FDR = 1 × 105 (bottom) as the ‘true’ set (see Methods) and the naïve (uncorrected) within-cohort significance as the predictor (see Supplementary Fig. 2). IT1, Italy 1; IT2, Italy 2; JP, Japan; other country codes are as in Fig. 1b. Source Data

Supplementary information

  1. Supplementary information

    Supplementary Figures 1–8, Supplementary Tables 1, 2, and 5, and Supplementary References

  2. Reporting Summary

  3. Supplementary Tables

    Supplementary Tables 3, 4, and 6

  4. Supplementary Data

    Supplementary Data 1 and 2

Source data

  1. Source Data Fig. 1

    Statistical Source Data

  2. Source Data Fig. 2

    Statistical Source Data

  3. Source Data Fig. 3

    Statistical Source Data

  4. Source Data Fig. 4

    Statistical Source Data

  5. Source Data Fig. 5

    Statistical Source Data

  6. Source Data Extended Data Fig. 1

    Statistical Source Data

  7. Source Data Extended Data Fig. 2

    Statistical Source Data

  8. Source Data Extended Data Fig. 3

    Statistical Source Data

  9. Source Data Extended Data Fig. 4

    Statistical Source Data

  10. Source Data Extended Data Fig. 5

    Statistical Source Data

  11. Source Data Extended Data Fig. 6

    Statistical Source Data

  12. Source Data Extended Data Fig. 7

    Statistical Source Data

  13. Source Data Extended Data Fig. 8

    Statistical Source Data

  14. Source Data Extended Data Fig. 9

    Statistical Source Data

  15. Source Data Extended Data Fig. 10

    Statistical Source Data

About this article

Publication history




Issue Date


Further reading