Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation

An Author Correction to this article was published on 29 October 2019

This article has been updated

Abstract

Several studies have investigated links between the gut microbiome and colorectal cancer (CRC), but questions remain about the replicability of biomarkers across cohorts and populations. We performed a meta-analysis of five publicly available datasets and two new cohorts and validated the findings on two additional cohorts, considering in total 969 fecal metagenomes. Unlike microbiome shifts associated with gastrointestinal syndromes, the gut microbiome in CRC showed reproducibly higher richness than controls (P < 0.01), partially due to expansions of species typically derived from the oral cavity. Meta-analysis of the microbiome functional potential identified gluconeogenesis and the putrefaction and fermentation pathways as being associated with CRC, whereas the stachyose and starch degradation pathways were associated with controls. Predictive microbiome signatures for CRC trained on multiple datasets showed consistently high accuracy in datasets not considered for model training and independent validation cohorts (average area under the curve, 0.84). Pooled analysis of raw metagenomes showed that the choline trimethylamine-lyase gene was overabundant in CRC (P = 0.001), identifying a relationship between microbiome choline metabolism and CRC. The combined analysis of heterogeneous CRC cohorts thus identified reproducible microbiome biomarkers and accurate disease-predictive models that can form the basis for clinical prognostic tests and hypothesis-driven mechanistic studies.

Access options

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Fig. 1: Reproducible taxonomic and functional microbial biomarkers across datasets when comparing carcinoma to healthy controls (no adenoma samples considered).
Fig. 2: Assessment of prediction performances of the gut microbiome for CRC detection within and across cohorts.
Fig. 3: Ranking relevance of each species in the predictive models for each dataset and identification of a minimal microbial signature for CRC detection.
Fig. 4: Choline TMA-lyase gene cutC and its genetic variants are strong biomarkers for CRC-associated stool samples.
Fig. 5: Clinical potential and validation of the predictive biomarkers.

Data availability

Nucleotide sequences for the two new Italian cohorts are available in the Sequence Read Archive under accession No. SRP136711. MetaPhlAn2 and HUMANn2 profiles for the new cohorts were also added to the curatedMetagenomicData R package27 along with their corresponding metadata. Validation Cohort1 is available in the European Nucleotide Archive under the study identifier PRJEB27928; Validation Cohort2 is available in the DNA data bank of Japan databases under the accession No. DRA006684.

Change history

  • 29 October 2019

    An amendment to this paper has been published and can be accessed via a link at the top of the paper.

References

  1. 1.

    Ferlay, J. et al. Cancer incidence and mortality worldwide: sources, methods and major patterns in GLOBOCAN 2012. Int. J. Cancer 136, E359–E386 (2015).

    CAS  Article  PubMed  PubMed Central  Google Scholar 

  2. 2.

    Siegel, R., Desantis, C. & Jemal, A. Colorectal cancer statistics, 2014. CA Cancer J. Clin. 64, 104–117 (2014).

    PubMed  Article  Google Scholar 

  3. 3.

    Frank, C., Sundquist, J., Yu, H., Hemminki, A. & Hemminki, K. Concordant and discordant familial cancer: familial risks, proportions and population impact. Int. J. Cancer 140, 1510–1516 (2017).

    CAS  PubMed  Article  Google Scholar 

  4. 4.

    Foulkes, W. D. Inherited susceptibility to common cancers. N. Engl. J. Med. 359, 2143–2153 (2008).

    CAS  PubMed  Article  Google Scholar 

  5. 5.

    Johnson, C. M. et al. Meta-analyses of colorectal cancer risk factors. Cancer Causes Control 24, 1207–1222 (2013).

    PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    Huxley, R. R. et al. The impact of dietary and lifestyle risk factors on risk of colorectal cancer: a quantitative overview of the epidemiological evidence. Int. J. Cancer 125, 171–180 (2009).

    CAS  PubMed  Article  Google Scholar 

  7. 7.

    Schmidt, T. S. B., Raes, J. & Bork, P. The human gut microbiome: from association to modulation. Cell 172, 1198–1215 (2018).

    CAS  PubMed  Article  Google Scholar 

  8. 8.

    Thomas, R. M. & Jobin, C. The microbiome and cancer: is the ‘oncobiome’ mirage real? Trends Cancer Res. 1, 24–35 (2015).

    Article  Google Scholar 

  9. 9.

    Jie, Z. et al. The gut microbiome in atherosclerotic cardiovascular disease. Nat. Commun. 8, 845 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  10. 10.

    Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  11. 11.

    Cougnoux, A. et al. Bacterial genotoxin colibactin promotes colon tumour growth by inducing a senescence-associated secretory phenotype. Gut 63, 1932–1942 (2014).

    CAS  PubMed  Article  Google Scholar 

  12. 12.

    Wu, S. et al. A human colonic commensal promotes colon tumorigenesis via activation of T helper type 17 T cell responses. Nat. Med. 15, 1016–1022 (2009).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. 13.

    Chung, L. et al. Bacteroides fragilis toxin coordinates a pro-carcinogenic inflammatory cascade via targeting of colonic epithelial cells. Cell Host Microbe 23, 203–214.e5 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  14. 14.

    Kostic, A. D. et al. Genomic analysis identifies association of Fusobacterium with colorectal carcinoma. Genome Res. 22, 292–298 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. 15.

    Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  17. 17.

    Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).

    CAS  PubMed  Article  Google Scholar 

  18. 18.

    Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).

    CAS  PubMed  Article  Google Scholar 

  19. 19.

    Zeller, G. et al. Potential of fecal microbiota for early‐stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  20. 20.

    Vogtmann, E. et al. Colorectal cancer and the human gut microbiome: reproducibility with whole-genome shotgun sequencing. PLoS One 11, e0155362 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  21. 21.

    Baxter, N. T., Ruffin, M. T., Rogers, M. A. M. & Schloss, P. D. Microbiota-based model improves the sensitivity of fecal immunochemical test for detecting colonic lesions. Genome Med. 8, 37 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  22. 22.

    Zackular, J. P., Rogers, M. A. M., Ruffin, M. T. 4th & Schloss, P. D. The human gut microbiome as a screening tool for colorectal cancer. Cancer Prev. Res. 7, 1112–1121 (2014).

    CAS  Article  Google Scholar 

  23. 23.

    Drewes, J. L. et al. High-resolution bacterial 16S rRNA gene profile meta-analysis and biofilm status reveal common colorectal cancer consortia. NPJ Biofilms Microbiomes 3, 34 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  24. 24.

    Pollock, J., Glendinning, L., Wisedchanwet, T. & Watson, M. The madness of microbiome: attempting to find consensus ‘best practice’ for 16S microbiome studies. Appl. Environ. Microbiol. 84, e02627–17 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  25. 25.

    Segata, N. On the road to strain-resolved comparative metagenomics. mSystems 3, e00190–17 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  26. 26.

    Truong, D. T., Tett, A., Pasolli, E., Huttenhower, C. & Segata, N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 27, 626–638 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. 27.

    Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  28. 28.

    Dai, Z. et al. Multi-cohort analysis of colorectal cancer metagenome identified altered bacteria across populations and universal bacterial markers. Microbiome 6, 70 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  29. 29.

    Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017).

    CAS  PubMed  Article  Google Scholar 

  30. 30.

    Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. https://doi.org/10.1038/s41591-019-0406-6 (2019).

  31. 31.

    Hannigan, G. D., Duhaime, M. B., Ruffin, M. T. 4th, Koumpouras, C. C. & Schloss, P. D. Diagnostic potential and interactive dynamics of the colorectal cancer virome. MBio 9, e02248–18 (2018).

    PubMed  PubMed Central  Article  Google Scholar 

  32. 32.

    Thomas, A. M. et al. Tissue-associated bacterial alterations in rectal carcinoma patients revealed by 16S rRNA community profiling. Front. Cell. Infect. Microbiol. 6, 179 (2016).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  33. 33.

    Gao, Z., Guo, B., Gao, R., Zhu, Q. & Qin, H. Microbiota disbiosis is associated with colorectal cancer. Front. Microbiol. 6, 20 (2015).

    PubMed  PubMed Central  Google Scholar 

  34. 34.

    Ahn, J. et al. Human gut microbiome and risk for colorectal cancer. J. Natl Cancer Inst. 105, 1907–1911 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  35. 35.

    Flemer, B. et al. The oral microbiota in colorectal cancer is distinctive and predictive. Gut 67, 1454–1463 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  36. 36.

    Brito, I. L. et al. Mobile genes in the human microbiome are structured from global to individual scales. Nature 535, 435–439 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  37. 37.

    Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

    Article  CAS  Google Scholar 

  38. 38.

    Bonder, M. J. et al. The effect of host genetics on the gut microbiome. Nat. Genet. 48, 1407 (2016).

    CAS  PubMed  Article  Google Scholar 

  39. 39.

    Segata, N. et al. Metagenomic biomarker discovery and explanation. Genome Biol. 12, R60 (2011).

    PubMed  PubMed Central  Article  Google Scholar 

  40. 40.

    Xie, Y.-H. et al. Fecal Clostridium symbiosum for noninvasive detection of early and advanced colorectal cancer: test and validation studies. EBioMedicine 25, 32–40 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  41. 41.

    Boleij, A., van Gelder, M. M. H. J., Swinkels, D. W. & Tjalsma, H. Clinical Importance of Streptococcus gallolyticus infection among colorectal cancer patients: systematic review and meta-analysis. Clin. Infect. Dis. 53, 870–878 (2011).

    CAS  PubMed  Article  Google Scholar 

  42. 42.

    Fijan, S. Microorganisms with claimed probiotic properties: an overview of recent literature. Int. J. Environ. Res. Public Health 11, 4745–4767 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Apweiler, R. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  44. 44.

    Gerner, E. W. & Meyskens, F. L. Jr Polyamines and cancer: old molecules, new understanding. Nat. Rev. Cancer 4, 781–792 (2004).

    CAS  PubMed  Article  Google Scholar 

  45. 45.

    Costea, P. I. et al. Towards standards for human fecal sample processing in metagenomic studies. Nat. Biotechnol. 35, 1069–1076 (2017).

    CAS  PubMed  Article  Google Scholar 

  46. 46.

    Riester, M. et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J. Natl Cancer Inst. 106, dju048 (2014).

    PubMed  PubMed Central  Article  Google Scholar 

  47. 47.

    Kummen, M. et al. Elevated trimethylamine-N-oxide (TMAO) is associated with poor prognosis in primary sclerosing cholangitis patients with normal liver function. United European Gastroenterol. J. 5, 532–541 (2017).

    CAS  PubMed  Article  Google Scholar 

  48. 48.

    Oellgaard, J., Winther, S. A., Hansen, T. S., Rossing, P. & von Scholten, B. J. Trimethylamine N-oxide (TMAO) as a new potential therapeutic target for insulin resistance and cancer. Curr. Pharm. Des. 23, 3699–3712 (2017).

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Kalnins, G. et al. Structure and function of CutC choline lyase from human microbiota bacterium Klebsiella pneumoniae. J. Biol. Chem. 290, 21732–21740 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  50. 50.

    Rath, S., Heidrich, B., Pieper, D. H. & Vital, M. Uncovering the trimethylamine-producing bacteria of the human gut microbiota. Microbiome 5, 54 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  51. 51.

    Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 1–14 (2019).

    Article  CAS  Google Scholar 

  52. 52.

    Hothorn, T., Hornik, K., van de Wiel, M. A. & Zeileis, A. A lego system for conditional inference. Am. Stat. 60, 257–263 (2006).

    Article  Google Scholar 

  53. 53.

    Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    CAS  PubMed  Article  Google Scholar 

  54. 54.

    Karlsson, F. H. et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature 498, 99–103 (2013).

    CAS  PubMed  Article  Google Scholar 

  55. 55.

    Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).

    CAS  PubMed  Article  Google Scholar 

  56. 56.

    Integrative HMP (iHMP) Research Network Consortium. The integrative human microbiome project: dynamic analysis of microbiome-host omics profiles during periods of human health and disease. Cell Host Microbe 16, 276–289 (2014).

    Article  CAS  Google Scholar 

  57. 57.

    Dejea, C. M. et al. Patients with familial adenomatous polyposis harbor colonic biofilms containing tumorigenic bacteria. Science 359, 592–597 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  58. 58.

    Manichanh, C. et al. Reduced diversity of faecal microbiota in Crohn’s disease revealed by a metagenomic approach. Gut 55, 205–211 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  59. 59.

    Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).

    PubMed  Article  CAS  Google Scholar 

  60. 60.

    Bae, S. et al. Plasma choline metabolites and colorectal cancer risk in the Women’s Health Initiative Observational Study. Cancer Res. 74, 7442–7452 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  61. 61.

    Xu, R., Wang, Q. & Li, L. A genome-wide systems analysis reveals strong link between colorectal cancer and trimethylamine N-oxide (TMAO), a gut microbial metabolite of dietary meat and fat. BMC Genomics 16(Suppl 7), S4 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  62. 62.

    Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  63. 63.

    Truong, D. T. et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat. Methods 12, 902–903 (2015).

    CAS  PubMed  Article  Google Scholar 

  64. 64.

    Franzosa, E. A. et al. Species-level functional profiling of metagenomes and metatranscriptomes. Nat. Methods 15, 962–968 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  65. 65.

    Breiman, L. Random forests. Mach. Learn. 45, 5–32 (2001).

    Article  Google Scholar 

  66. 66.

    Pedregosa, F. et al. Scikit-learn: machine learning in python. J. Mach. Learn. Res. 12, 2825–2830 (2011).

    Google Scholar 

  67. 67.

    Hastie, T, Tibshirani, R. & Friedman, J. The Elements of Statistical Learning Vol. 1 (Springer, 2009).

  68. 68.

    Mandal, S. et al. Analysis of composition of microbiomes: a novel method for studying microbial composition. Microb. Ecol. Health Dis. 26, 27663 (2015).

    PubMed  Google Scholar 

  69. 69.

    Ritchie, M. E. et al. limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43, e47 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  70. 70.

    Altschul, S. F., Gish, W., Miller, W., Myers, E. W. & Lipman, D. J. Basic local alignment search tool. J. Mol. Biol. 215, 403–410 (1990).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  71. 71.

    Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).

    CAS  Article  Google Scholar 

  72. 72.

    Segata, N., Börnigen, D., Morgan, X. C. & Huttenhower, C. PhyloPhlAn is a new method for improved phylogenetic and taxonomic placement of microbes. Nat. Commun. 4, 2304 (2013).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  73. 73.

    Kaminski, J. et al. High-specificity targeted functional profiling in microbial communities with ShortBRED. PLoS Comput. Biol. 11, e1004557 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  74. 74.

    Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).

    CAS  PubMed  Article  Google Scholar 

  75. 75.

    Li, H. et al. The sequence alignment/map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  76. 76.

    Danecek, P. et al. The variant call format and VCFtools. Bioinformatics 27, 2156–2158 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  77. 77.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  78. 78.

    Price, M. N., Dehal, P. S. & Arkin, A. P. FastTree 2–approximately maximum-likelihood trees for large alignments. PLoS ONE 5, e9490 (2010).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  79. 79.

    Stamatakis, A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics 30, 1312–1313 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  80. 80.

    Asnicar, F., Weingart, G., Tickle, T. L., Huttenhower, C. & Segata, N. Compact graphical representation of phylogenetic data and metadata with GraPhlAn. PeerJ 3, e1029 (2015).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  81. 81.

    Livak, K. J. & Schmittgen, T. D. Analysis of relative gene expression data using real-time quantitative PCR and the 2–ΔΔC(T) Method. Methods 25, 402–408 (2001).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references

Acknowledgements

We thank the members of the Segata, Naccarati and Waldron groups for insightful discussions, all the volunteers enrolled in the study, the NGS facility at the University of Trento for performing the metagenomic sequencing, and the HPC facility at the University of Trento for supporting the computational experiments. This work was primarily supported by Lega Italiana per La Lotta contro i Tumori to N.S., F.C. and A.N., and by Fundação de Amparo à Pesquisa do Estado de São Paulo (FAPESP, grant No. 16/23527-2) to A.M.T. This work was also partially supported by the Conselho Nacional de Pesquisa e Desenvolvimento (CNPq, Brazil) to J.C.S. and E.D.-N.; by FAPESP (grant No. 14/26897-0); by Associação Beneficente Alzira Denise Hertzog Silva (ABADHS, Brazil) and PRONON/SIPAR to E.D.-N.; by Coordenação de Aperfeiçoamento de Pessoal de Nível Superior – Brazil (CAPES) (Finance Code No. 001 to J.C.S.; by the Italian Institute for Genomic Medicine (IIGM) and Compagnia di San Paolo Torino to A.N, A.F., B.P. and S.T.; by Fondazione Umberto Veronesi ‘Post-doctoral Fellowship Years 2014, 2015, 2016, 2017 and 2018’ to B.P. and S.T.; by the Grant Agency of the Czech Republic (grant No. 17-16857S) to A.N.; by Fondazione Umberto Veronesi (grant No. FUV-14-SG-GANDINI) to S.G.; by the European Union H2020 Marie Curie grant (No. 707345) to E.P.; by the European Research Council (ERC-STG project MetaPG) to N.S.; by MIUR ‘Futuro in Ricerca’ (grant No. RBFR13EWWI_001) to N.S.; by the People Programme (Marie Curie Actions) of the European Union FP7 and H2020 to N.S.; and by the National Cancer Institute (grant No. U24CA180996) and National Institute of Allergy and Infectious Diseases (grant No. 1R21AI121784-01) of the National Institutes of Health to L.W. B.P. is the recipient of a Fulbright Research Scholarship (year 2018). We acknowledge funding from EMBL, DKFZ, the Huntsman Cancer Foundation, the Intramural Research Program of the National Cancer Institute, ETH Zürich and the following external sources: the European Research Council (CancerBiome, grant No. ERC-2010-AdG_20100317) to P.B.; Microbios (No. ERC-AdG-669830) to P.B.; the Novo Nordisk Foundation (grant No. NNF10CC1016515) to M.A.; the Danish Diabetes Academy supported by the Novo Nordisk Foundation and TARGET research initiative (Danish Strategic Research Council, grant No. 0603-00484B) to M.A.; the Matthias-Lackas Foundation (to C.M.U.); the National Cancer Institute (grant Nos. R01 CA189184, R01 CA207371, U01 CA206110 and P30 CA042014 ll to C.M.U.); the BMBF (the de.NBI network, grant No. 031A537B) to P.B.; the ERA-NET TRANSCAN project (No. 01KT1503) to C.M.U.; and the Helmut Horten Foundation (to S.S.). For Validation Cohort2, funding was provided by grants from the National Cancer Center Research and Development Fund (grant Nos. 25-A-4, 28-A-4 and 29-A-6); Practical Research Project for Rare/Intractable Diseases from the Japan Agency for Medical Research and Development (AMED, grant No. JP18ek0109187); JST (Japan Science and Technology Agency)-PRESTO (grant No. JPMJPR1507); Japan Society for the Promotion of Science KAKENHI (grant Nos. 16J10135, 142558 and 221S0002); Joint Research Project of the Institute of Medical Science; the University of Tokyo; the Takeda Science Foundation; and the Suzuken Memorial Foundation.

Author information

Affiliations

Authors

Contributions

N.S., A.M.T., L.W. and A.N conceived the study. N.S. supervised the study. C.P., S.G., D.S., S.T., A.F., G.G., M.T., B.P, M.R. and A.N. organized the clinical study, recruited patients and collected samples. F. Armanini generated metagenomic data. A.M.T., P.M., F. Asnicar, E.P., M.Z., F.B., N.K. and G.F. collected and analyzed the metagenomic data. A.M.T., P.M., F. Asnicar, E.P., M.Z., G.F., J.W., G.Z. and L.W. performed machine learning and statistical analyses. F. Armanini, S.T., S. Manara, A.T., B.P. and A.N. performed validation experiments. S. Mizutani., H.S., S. Shiba, T.S., S.Y., T.Y., J.W., P.S.-K, C.M.U., H.B., M.A., P.B. and G.Z. provided additional validation data. A.M.T., P.M., L.W. and N.S. designed and produced the figures. A.M.T., P.M. and N.S. wrote the manuscript with contributions from S. Manara, F.C., E.D.-N., J.C.S., M.R., L.W. and A.N. All authors discussed and approved the manuscript.

Corresponding author

Correspondence to Nicola Segata.

Ethics declarations

Competing interests

P.B., G.Z., A.Y.V. and S.S. are named inventors on a patent (EP2955232A1: Method for diagnosing colorectal cancer based on analyzing the gut microbiome). All other authors declare no competing interests as defined by Nature Research, or other interests that might be perceived to influence the results and/or discussion reported in this paper.

Additional information

Publisher’s note: Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Sequencing depths and species richness across CRC datasets

a, Boxplots reporting the total number of reads in each dataset. P values between the carcinoma and control groups were calculated by two-tailed Wilcoxon rank-sum tests. b, Boxplots showing the total number of microbial species per dataset. P values were calculated by two-tailed Wilcoxon rank-sum tests. c, Boxplots showing the total number of microbial species per dataset calculated on metagenomes subsampled in each dataset to the number of reads of the tenth percentile. P values were calculated by two-tailed Wilcoxon rank-sum tests. d, Multivariate analysis of species richness using crude and age-, sex- and BMI-adjusted coefficients obtained from linear models. e, Meta-analysis of crude and adjusted multivariate richness coefficients using a random effects model. Bold lines represent the 95% confidence interval for the random effects model estimate.

Extended Data Fig. 2 Meta-analysis of species diversity and oral species richness in CRC datasets.

a, Boxplots reporting the Shannon species diversity in each dataset. P values between the carcinoma and control groups were calculated by two-tailed Wilcoxon rank-sum tests. b, Boxplots reporting the Shannon species diversity calculated on metagenomes subsampled in each dataset to the number of reads of the tenth percentile. P values were calculated by two-tailed Wilcoxon rank-sum tests. c, Multivariate analysis of species diversity using crude and age-, sex- and BMI-adjusted coefficients obtained from linear models. d, Meta-analysis of crude and adjusted multivariate Shannon diversity coefficients using a random effects model. Bold lines represent the 95% confidence interval for the random effects model estimate. e, Boxplots reporting the total number of oral microbial species per dataset. P values were calculated by two-tailed Wilcoxon rank-sum tests comparing values between controls and carcinomas for each dataset. f, Multivariate analysis of putative oral species richness using crude and age-, sex- and BMI-adjusted coefficients obtained from linear models. g, Meta-analysis of crude and adjusted multivariate putative oral species richness coefficients using a random effects model. Bold lines represent the 95% confidence interval for the random effects model estimate.

Extended Data Fig. 3 Two metagenomic cohorts identify clear but only partially overlapping microbiome signatures associated with CRC.

a,b, Relative abundances (log scale) and effect sizes (estimated using the linear discriminant analysis score in LEfSe) for the significantly different microbial species in CRC samples compared to control samples for Cohort1 (significance assessed by the non-parametric test in LEfSe) (a) and Cohort2 (b). c, Alpha-diversities measured as the total number of species and total number of UniProt90 gene families in each sample for the two cohorts. d, Beta-diversities estimated with the Bray–Curtis dissimilarity metric for intra- and inter-condition comparisons in the two cohorts.

Extended Data Fig. 4 Analysis of F. nucleatum markers and taxonomic meta-analysis of CRC datasets.

a, Percentages of F. nucleatum clade-specific markers (200 in total) in each dataset. P values were obtained by two-tailed Wilcoxon rank-sum tests comparing values between controls and carcinomas for each dataset. b, Multivariate analysis of meta-analysis species-level abundance biomarkers. Crude and age-, sex- and BMI-adjusted coefficients for species associated with disease status in the meta-analysis of standardized mean differences. c, Meta-analysis of CRC datasets using species-level MetaPhlAn2 profiles. Bold lines represent the 95% confidence interval for the random effects model estimate.

Extended Data Fig. 5 Analysis of putative oral species abundances in CRC datasets and gene-family richness across CRC datasets.

a, Effect sizes of the abundances of significant putative oral species identified using a meta-analysis of standardized mean differences and a random effects model. Bold lines represent the 95% confidence interval for the random effects model estimate. b, Total abundance of putative oral species in each gut metagenomic dataset. P values were obtained by two-tailed Wilcoxon rank-sum tests comparing values between controls and carcinomas for each dataset. c, The total number of reads in each sample of each dataset correlates with the total number of gene families identified using HUMANn2. Ellipses represent the 95% confidence level assuming a multivariate t-distribution. d, Distribution of the total number of gene families identified in the samples of each dataset. P values were obtained by two-tailed Wilcoxon rank-sum tests comparing values between controls and carcinomas for each dataset. e, Distribution of the percentages of unmapped reads across datasets for UniProt90 gene families.

Extended Data Fig. 6 Cross-validation, cross-cohort and LODO predictions using pathway abundances, species abundances and species-specific markers.

a, Prediction matrix reporting prediction performances as AUC values obtained using a random forest model on pathway relative abundances. Values on the diagonal refer to 20 times repeated tenfold stratified cross-validations. Off-diagonal values refer to the AUC values obtained by training the classifier on the dataset of the corresponding row and applying it to the dataset of the corresponding column. The LODO row refers to the performances obtained by training the model on pathway abundances using all but the dataset of the corresponding column and applying it to the dataset of the corresponding column. b, Prediction matrix as in a but using MetaPhlAn2 marker presence and absence information. c, Prediction of samples-to-cohort assignments using species-level relative abundances. Only control samples from each dataset are considered. d, Principal coordinate analysis of Bray–Curtis distances computed on MetaPhlAn2 species-level abundances across datasets. Ellipses represent the 95% confidence level assuming a multivariate t-distribution. e, Cross-prediction matrix for the performances of random forest models in predicting adenomas versus CRC conditions. f, Cross-prediction matrix as described in e but on the distinction of adenomas versus controls.

Extended Data Fig. 7 Prediction performances with increasing numbers of external datasets considered in the training model.

a, Prediction performances computed based on MetaPhlAn2 species abundances. The dark yellow line interpolates the median AUC at each number of training datasets considered. b, Prediction performances computed based on HUMANn2 gene-family abundances.

Extended Data Fig. 8 Identification of a minimal number of microbial gene families for CRC detection.

Prediction performances in the LODO settings at increasing numbers of gene families. Each ranking is obtained excluding the testing dataset to avoid overfitting.

Extended Data Fig. 9 Metagenomic analysis of genes involved in the TMA synthesis pathway.

a, ShortBRED analysis of yeaW and caiT gene abundances. Points represent the log of RPKM for each sample and crosses represent average values per group/dataset. b, ShortBRED analysis of cutD gene abundances. Boxplots report the RKPM abundances obtained using ShortBRED for the gene of the activating TMA-lyase enzyme cutD. P values were calculated by two-tailed Wilcoxon rank-sum tests comparing values between controls and carcinomas for each dataset. c, Forest plot showing effect sizes calculated using a meta-analysis of standardized mean differences and a random effects model on cutD RPKM abundances between carcinomas and controls. d, Breadth of coverage of cutC gene sequence clusters across CRC datasets. e, Depth of coverage of cutC gene sequence clusters across CRC datasets.

Extended Data Fig. 10 Cluster analysis of representative cutC sequence variants of samples.

a, Prediction strengths at differing numbers of clusters showing optimum numbers at two and four clusters. b, Tables showing the number of samples for carcinomas, adenomas and controls with breadth of coverage >80% at two different cluster thresholds. P values were calculated using a Fisher t-test, and taxonomy was assigned by BLASTn and the cutC sequence database (criteria of 80% coverage, >97% identity and minimum 2,000 nt alignment length).

Supplementary information

Reporting Summary

Supplementary Tables

Supplementary Tables 1–5

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Thomas, A.M., Manghi, P., Asnicar, F. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat Med 25, 667–678 (2019). https://doi.org/10.1038/s41591-019-0405-7

Download citation

Further reading

Search

Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing