Skip to main content

Thank you for visiting You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

Discovery of bioactive microbial gene products in inflammatory bowel disease


Microbial communities and their associated bioactive compounds1,2,3 are often disrupted in conditions such as the inflammatory bowel diseases (IBD)4. However, even in well-characterized environments (for example, the human gastrointestinal tract), more than one-third of microbial proteins are uncharacterized and often expected to be bioactive5,6,7. Here we systematically identified more than 340,000 protein families as potentially bioactive with respect to gut inflammation during IBD, about half of which have not to our knowledge been functionally characterized previously on the basis of homology or experiment. To validate prioritized microbial proteins, we used a combination of metagenomics, metatranscriptomics and metaproteomics to provide evidence of bioactivity for a subset of proteins that are involved in host and microbial cell–cell communication in the microbiome; for example, proteins associated with adherence or invasion processes, and extracellular von Willebrand-like factors. Predictions from high-throughput data were validated using targeted experiments that revealed the differential immunogenicity of prioritized Enterobacteriaceae pilins and the contribution of homologues of von Willebrand factors to the formation of Bacteroides biofilms in a manner dependent on mucin levels. This methodology, which we term MetaWIBELE (workflow to identify novel bioactive elements in the microbiome), is generalizable to other environmental communities and human phenotypes. The prioritized results provide thousands of candidate microbial proteins that are likely to interact with the host immune system in IBD, thus expanding our understanding of potentially bioactive gene products in chronic disease states and offering a rational compendium of possible therapeutic compounds and targets.

This is a preview of subscription content

Access options

Buy article

Get time limited or full article access on ReadCube.


All prices are NET prices.

Fig. 1: Many protein families in the IBD microbiome are uncharacterized and can be putatively annotated and prioritized for potential bioactivity.
Fig. 2: Prioritization of uncharacterized proteins implicated in potential bioactivity and association with IBD severity.
Fig. 3: A combination of known and novel pro-inflammatory Enterobacteriaceae pilins is highly prioritized as enriched in IBD dysbiosis.
Fig. 4: IBD-associated uncharacterized VWA-containing exoproteins depleted during inflammation are highly prioritized and suggest diverse mechanisms of maintenance of extracellular homeostasis.

Data availability

Associated data generated during this study are included in the published Article and its Supplementary Tables. All assembled metagenomic contigs, ORFs, gene families, protein families, functional profiles, taxonomic profiles and prioritized profiles of protein families related with this study are available at The raw data for the HMP2 metagenomes, metatranscriptomes and metaproteomes were obtained from the IBDMDB website (, NCBI BioProject PRJNA398089). Sequence data for the Red Sea metagenomes were obtained from SRA BioProject PRJNA289734. The following public databases were used: UniProt (, UniRef90 (, Pfam (, DOMINE (, the Expression Atlas (, SIFTS (, the Database of Essential Genes ( and the PDB (

Code availability

The open-source MetaWIBELE software is available through Manuals and online tutorials describing MetaWIBELE are available at User support is provided through the bioBakery help forum ( Additional software details are provided in the Methods.


  1. Cohen, L. J. et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53 (2017).

    CAS  PubMed  PubMed Central  Article  ADS  Google Scholar 

  2. Guo, C. J. et al. Discovery of reactive microbiota-derived metabolites that inhibit host proteases. Cell 168, 517–526 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  3. Bhattarai, Y. et al. Gut microbiota-produced tryptamine activates an epithelial G-protein-coupled receptor to increase colonic secretion. Cell Host Microbe 23, 775–785 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  4. Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).

    CAS  PubMed  PubMed Central  Article  ADS  Google Scholar 

  5. Galperin, M. Y. & Koonin, E. V. ‘Conserved hypothetical’ proteins: prioritization of targets for experimental study. Nucleic Acids Res. 32, 5452–5463 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  6. Galperin, M. Y. & Koonin, E. V. From complete genome sequence to ‘complete’ understanding? Trends Biotechnol. 28, 398–406 (2010).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  7. Joice, R., Yasuda, K., Shafquat, A., Morgan, X. C. & Huttenhower, C. Determining microbial products and identifying molecular targets in the human microbiome. Cell Metab. 20, 731–741 (2014).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  8. Buffie, C. G. et al. Precision microbiome reconstitution restores bile acid mediated resistance to Clostridium difficile. Nature 517, 205–208 (2015).

    CAS  PubMed  Article  ADS  Google Scholar 

  9. Zipperer, A. et al. Human commensals producing a novel antibiotic impair pathogen colonization. Nature 535, 511–516 (2016).

    CAS  PubMed  Article  ADS  Google Scholar 

  10. Morgan, X. C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  11. Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).

    CAS  PubMed  Article  Google Scholar 

  12. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  13. UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).

    Article  CAS  Google Scholar 

  14. Konstantinidis, K. T. & Tiedje, J. M. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 187, 6258–6264 (2005).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  15. Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    CAS  PubMed  Article  Google Scholar 

  16. Plaza Oñate, F. et al. MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data. Bioinformatics 35, 1544–1552 (2019).

    PubMed  Article  CAS  Google Scholar 

  17. Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).

    CAS  PubMed  Article  Google Scholar 

  18. Jandhyala, S. M. et al. Role of the normal gut microbiota. World J. Gastroenterol. 21, 8787–8803 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  19. Zhang, R., Ou, H. Y. & Zhang, C. T. DEG: a database of essential genes. Nucleic Acids Res. 32, D271–D272 (2004).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  20. Sokol, H. et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl Acad. Sci. USA 105, 16731–16736 (2008).

    CAS  PubMed  PubMed Central  Article  ADS  Google Scholar 

  21. Lopez-Siles, M., Duncan, S. H., Garcia-Gil, L. J. & Martinez-Medina, M. Faecalibacterium prausnitzii: from microbiology to diagnostics and prognostics. ISME J. 11, 841–852 (2017).

    PubMed  PubMed Central  Article  Google Scholar 

  22. Schirmer, M., Garner, A., Vlamakis, H. & Xavier, R. J. Microbial genes and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17, 497–511 (2019).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  23. Lewis, J. D. et al. Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host Microbe 18, 489–500 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  24. Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature Microbiol. 4, 293–305 (2019).

    CAS  Article  Google Scholar 

  25. Hall, A. B. et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients. Genome Med. 9, 103 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  26. Hughes, E. R. et al. Microbial respiration and formate oxidation as metabolic signatures of inflammation-associated dysbiosis. Cell Host Microbe 21, 208–219 (2017).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  27. Högbom, M. & Ihalin, R. Functional and structural characteristics of bacterial proteins that bind host cytokines. Virulence 8, 1592–1601 (2017).

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  28. Wells, T. J., Tree, J. J., Ulett, G. C. & Schembri, M. A. Autotransporter proteins: novel targets at the bacterial cell surface. FEMS Microbiol. Lett. 274, 163–172 (2007).

    CAS  PubMed  Article  Google Scholar 

  29. Pizarro-Cerdá, J. & Cossart, P. Bacterial adhesion and entry into host cells. Cell 124, 715–727 (2006).

    PubMed  Article  CAS  Google Scholar 

  30. Palmela, C. et al. Adherent-invasive Escherichia coli in inflammatory bowel disease. Gut 67, 574–587 (2018).

    CAS  PubMed  Article  Google Scholar 

  31. Xu, Q. et al. A Distinct type of pilus from the human microbiome. Cell 165, 690–703 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  32. Zhang, Y., Thompson, K. N., Huttenhower, C. & Franzosa, E. A. Statistical approaches for differential expression analysis in metatranscriptomics. Bioinformatics 37, i34–i41 (2021).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  33. Starks, A. M., Froehlich, B. J., Jones, T. N. & Scott, J. R. Assembly of CS1 pili: the role of specific residues of the major pilin, CooA. J. Bacteriol. 188, 231–239 (2006).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  34. Galkin, V. E. et al. The structure of the CS1 pilus of enterotoxigenic Escherichia coli reveals structural polymorphism. J. Bacteriol. 195, 1360–1370 (2013).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  35. Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  36. Dalbey, R. E. & Kuhn, A. Protein traffic in Gram-negative bacteria—how exported and secreted proteins find their way. FEMS Microbiol. Rev. 36, 1023–1045 (2012).

    CAS  PubMed  Article  Google Scholar 

  37. Costa, T. R. et al. Secretion systems in Gram-negative bacteria: structural and mechanistic insights. Nat. Rev. Microbiol. 13, 343–359 (2015).

    CAS  PubMed  Article  Google Scholar 

  38. Shipman, J. A., Berleman, J. E. & Salyers, A. A. Characterization of four outer membrane proteins involved in binding starch to the cell surface of Bacteroides thetaiotaomicron. J. Bacteriol. 182, 5365–5372 (2000).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  39. Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).

    CAS  PubMed  PubMed Central  Article  ADS  Google Scholar 

  40. Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  41. Dong, R., Pan, S., Peng, Z., Zhang, Y. & Yang, J. mTM-align: a server for fast protein structure database search and multiple protein structure alignment. Nucleic Acids Res. 46, W380–w386 (2018).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  42. Treuner-Lange, A. et al. PilY1 and minor pilins form a complex priming the type IVa pilus in Myxococcus xanthus. Nat. Commun. 11, 5054 (2020).

    CAS  PubMed  PubMed Central  Article  ADS  Google Scholar 

  43. Co, J. Y. et al. Mucins trigger dispersal of Pseudomonas aeruginosa biofilms. NPJ Biofilms Microbiomes 4, 23 (2018).

    PubMed  PubMed Central  Article  ADS  Google Scholar 

  44. Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

  45. Haroon, M. F., Thompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci. Data 3, 160050 (2016).

    CAS  PubMed  PubMed Central  Article  Google Scholar 

Download references


This work has been supported in part by a research agreement with Takeda Pharmaceuticals (C.H.) and by NIH NIDDK grants R24DK110499 (C.H., W.S.G. and R.J.X.), P30DK043351 (R.J.X.), the Center for Microbiome Informatics and Therapeutics (R.J.X.), NIH AT009708 (R.J.X.), and DK 127171 (R.J.X.). We especially appreciate the participants in the HMP2 Inflammatory Bowel Disease Multi-omics Database who made this study possible. The computations in this paper were run in part on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

Author information

Authors and Affiliations



Y.Z., A.K., C.H. and E.A.F. designed the research. Y.Z., A.B., A. Subramanian, A.R., A. Shafquat and E.A.F. performed computational analysis. S.B. performed the experimental validation of pilins. G.P. performed the experiments for validating VWF-containing proteins. Y.Z. and L.J.M. implemented the software. Y.Z. and E.K.A. wrote the tutorial document and tested the software. C.A. and D.R.P. participated in generating the assembly data. Y.Z. and C.H. wrote the manuscript with feedback from the other authors. K.N.T., Y.W., S.M.K., A.P., E.A.F. and all other authors participated in editing the manuscript. R.J.X., H.V., W.S.G. and A.K. participated in interpretation of the primary findings. C.H. and E.A.F. supervised the research. All authors approved the final manuscript.

Corresponding authors

Correspondence to Curtis Huttenhower or Eric A. Franzosa.

Ethics declarations

Competing interests

C.H. is on the scientific advisory board of Seres Therapeutics and Empress Therapeutics. W.S.G. is on the scientific advisory board of Freya Biosciences, Senda Biosciences, Artizan Biosciences and Tenza. The laboratory of W.S.G. receives funding from Merck. R.J.X. is a member of the scientific advisory board of Nestle and Senda Biosciences. A.K. presents employment by Takeda that may gain or lose financially through this publication.

Peer review

Peer review information

Nature thanks Robert Quinn, Paul Wilmes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Overview of MetaWIBELE workflow and analysis summary in the HMP2 dataset.

a, MetaWIBELE identifies novel potentially bioactive gene products from microbial communities. MetaWIBELE prioritizes and partially annotates putatively bioactive gene products from shotgun metagenomes, using a combination of primary and secondary sequence properties, ecological distributions, and host or environmental phenotypes. The process begins with single-sample metagenomic assemblies, from which open reading frames are called and clustered into gene families. These are quantified, annotated (MetaWIBELE-characterize), and ranked by likely bioactivity (MetaWIBELE-prioritize). This results in proteins from across a set of communities with potential bioactivity in their environments of origin, annotated with the quantitative sources of this bioactivity evidence and per-family information such as abundance, taxonomic origin, and (when known) putative molecular roles. b, Quantitative characteristics of MetaWIBELE applied to the 1,595 metagenomes in the HMP2. Overall strategy used by MetaWIBELE for protein family construction, annotation, and prioritization, and the associated input data and results when applied to datasets used for identifying microbial gene products with potential bioactivity in HMP2. SC: Strong homology to known characterized proteins, SU: Strong homology to known uncharacterized proteins, RH: Remote homology to known proteins, NH: No homology to known proteins. TM: transmembrane. DDI: domain-domain interaction.

Extended Data Fig. 2 Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins.

a, Nominally characterized and uncharacterized protein families were distinguished with homology-based search against UniRef90 (release 2019_01). We defined strong homology following the UniRef90 criterion of ≥90% identity and ≥80% coverage, remote homology as identity from 25% to 90% and coverage from 25% to 80%, and non-homologous proteins as those with <25% identity or <25% coverage or no hit to UniRef90 proteins. Here, we use ‘uncharacterized known proteins’ to refer to UniRef90 proteins that do not have any Gene Ontology annotations in UniProt (release 2019_01). Distribution of prevalences and abundances of protein families across the four categories of protein families. b, The fractions of novel proteins (proteins with remote homology or without homology to known proteins) are comparable to known proteins across samples. c, Bray-Curtis dissimilarities over protein family profiles between samples from different participants, samples from the same participant over time, and technical replicates. Variability among novel proteins was more extreme than among known proteins, but less extreme than among known proteins with rare abundance (bottom 50%). Box plot boxes indicate quartiles and whiskers show inner fences. d, Uncharacterized proteins with comparable abundance to known proteins fit a neutral model of microbiome assembly (Methods). ‘Unclassified taxon’ indicates a group of genes which lack taxonomic information but can be binned into the same MSP based on co-abundance information. eg, Uncharacterized proteins showed similar sequence composition with known proteins. Characterized and uncharacterized proteins had similar distributions of lengths of assembled contigs (e), protein lengths (f) and GC content (g).

Extended Data Fig. 3 An integrated annotation approach characterizes millions of gut microbial protein families.

a, We enumerated the degree to which annotations based on local homology or secondary structure could be assigned by MetaWIBELE. ‘InterPro signatures’ represents protein signatures in the InterPro except Pfam domains. ‘Interaction’ means domain-domain interactions as predicted by DOMINE. ‘Others’ includes other types of protein subcellular localization. (e.g., cytoplasmic, membrane, periplasmic, etc.). ‘Unknown function’ represents proteins without any putative biochemical annotations. b, Most of the functional annotations assigned by MetaWIBELE were consistent with those in UniProt when evaluated on characterized proteins. ‘UniProt_unique’ annotations are quite rare, indicating the good sensitivity of MetaWIBELE. Meanwhile, ‘MetaWIBELE_unique’ annotations are also in the minority, which could be a ceiling on false positives, but there are likely to be many false negatives from UniProt as well. c, Each row represents one type of annotation. Each column indicates the number of protein families with corresponding annotation types (indicated with black point) intersection. The ‘Unclassified taxonomy’ category represents protein families without taxonomy information. ‘MSPs’ (metagenomic species pangenomes) are built by binning co-abundant genes across metagenomic samples. ‘Domains’ are domain-based annotations including Pfam domains and domain-domain interactions. ‘Host facing’ indicates annotations which are likely to be involved in host-microbial interactions (e.g., signal peptides, transmembrane). ‘InterPro signatures’, ‘Others’ and ‘Unknown function’ are defined in a.

Extended Data Fig. 4 Novel protein families can be taxonomically annotated and greatly expand pangenomes of common gut taxa.

a, Schematic of MetaWIBELE’s guilt-by-association approach for per-protein-family taxonomic annotation leveraging co-abundance profiles (MSPs). If reference sequence annotations are consistent within a group of co-varying proteins, their most-specific shared taxonomy can be transferred to other sequences within the family. b, We validated this novel taxonomic annotation method on a 20% holdout set of known proteins. c, To optimize the parameters, we tested different cut-offs for the fraction of protein families between the most and second-most dominant taxon within MSP using the holdout set in b. Stringent cut-offs (i.e., requiring more consistently classified taxa) reduced the power of taxonomic assignment for more specific levels (e.g., species or genus) but controlled false positives. Lenient cut-offs (i.e., requiring less consistently classified taxa) introduced more spurious assignments with good sensitivity to the assignment of species or genus. This sensitivity-specificity trade-off is best-balanced at our default cut-off value of 0.5. d, Comparison of taxonomic annotations by homology-based and guilt-by-association approaches. e, The top 25 genera with the highest number of newly annotated proteins (Supplementary Table 3). The first row indicates the number of genomes in RefSeq per genus. The second row indicates the mean relative abundance of known (i.e., SC and SU) and novel proteins (RH and NH), in which red dots represent the mean of known proteins and blue dots represent the mean of novel proteins. f, Uncharacterized proteins expanded common gut taxa. Each clade represents one genus. Circle bars show relative abundance of different categories of protein families. g, Similar representative genera with dominant abundance were identified in HMP2 and MetaHIT. The top 50 genera (with highest mean abundance) were selected for plotting. Box plot boxes indicate quartiles and whiskers show inner fences.

Extended Data Fig. 5 Essential genes are assigned higher priority scores using MetaWIBELE’s unsupervised approach.

a, Prevalence and abundance of 1.6M protein families from the HMP2 metagenomes. Essential proteins (based on DEG homology, see full list from Supplementary Table 4) were enriched in proteins prioritized by the harmonic mean of these values. b, When assumed to be true positives (i.e., ‘important’ proteins), essential proteins were notably well-predicted by ecological properties. This was true across a range of beta parameter settings: i.e., the relative weight for prevalence versus abundance in the calculation of a unified priority score (higher beta implies more weight assigned to prevalence). ce, Distributions of prevalence (c), abundance (d) and priority scores (e) are plotted for all proteins and essential genes, respectively. Box plot boxes indicate quartiles, and whiskers show inner fences.

Extended Data Fig. 6 Protein families associated with severe IBD phenotypes.

a, A total of 348,973 protein families were prioritized as potentially bioactive in IBD, with all four categories of homology-based characterization dominated by proteins with decreased differential abundance (DA) during dysbiosis. The integrated priority score is a meta-rank combining both phenotypic significance and effect size of DA with ecological properties (abundance and prevalence). DA p-values are from modified linear models (Methods), and effect sizes are differences between means log-scaled abundances among phenotypes. Positive values indicate more abundance in ‘cases’ (i.e., the dysbiotic state of Crohn’s disease (CD) or ulcerative colitis (UC)). b, Functional annotations assigned to DA prioritized protein families by global homology (top left) or local structural properties (Methods). c, More protein families were depleted in dysbiosis samples than enriched in dysbiosis samples. The largest source of DA families (n = 1,595 from 130 participants; linear mixed-effects model: adjusted p-value with Benjamini–Hochberg FDR correction < 0.05) corresponded with the differences between dysbiotic and non-dysbiotic samples from individuals with CD, whereas those with UC were less well separated. The effect size was computed as the difference of mean values in the dysbiotic condition compared to the non-dysbiotic condition within each IBD phenotype at the log-transformed scale. d, Highly prioritized protein families of Ruminococcus gnavus were grouped into multiple MSPs. Most such R. gnavus proteins were enriched in the dysbiotic states of IBD and fell into msp_127 and msp_306, whereas a few proteins were depleted in dysbiosis and failed to cluster as MSP members (full list from Supplementary Table 12). e, Highly prioritized protein families of Faecalibacterium prausnitzii were grouped into multiple MSPs and tended to be depleted in dysbiosis (full list from Supplementary Table 13).

Extended Data Fig. 7 Potentially bioactive protein families are validated by metaproteomics.

a, Both known and novel proteins showed metaproteomics (MPX) evidence, though only a small fraction of protein families were detected owing to the relatively low coverage of all metaproteomics in the HMP2. ‘MPX-prevalent’ refers to a set with relatively higher prevalence in MPX samples for more consistent detection, in which we thresholded the mean value of the prevalence of proteins in MPX samples (full list from Supplementary Table 7). b, Among the MPX validated proteins, the fraction of prioritized novel proteins (e.g., RH and NH) was comparable to the known protein (e.g., SC and SU). c, The prioritized proteins were significantly enriched in the set of proteins with MPX evidence (two-tailed Fisher’s exact test; adjusted p-value with Benjamini–Hochberg FDR correction < 2.2e–16 for SC and SU, 3.3e-258 for RH, 1.4e-21 for NH). d, e, Protein families profiled by MPX had significantly higher priority scores for both known and novel proteins (GSEA method; FDR-adjusted P = 0.0012 for ‘MPX-prevalence’ regardless of characterization categories in d and stratified in SC, SU, RH in e, and FDR-adjusted P = 0.0051 for RH category in e; Supplementary Table 8). Prioritization distribution of ‘MPX-prevalent’ proteins with different characterization levels are shown in e.

Extended Data Fig. 8 Supporting evidence for the bioactivity of Enterobacteriaceae pilus components and VWF homologues.

a, Effect of bacterial co-culture on pilin gene expression in bacterial strains. Expression of a subset of highly prioritized bacterial pilin genes by RT–qPCR is normalized to rpoA. b, Expression of other cytokines in HCT-15 cells after co-culture with pilin-encoding strains (Group 1 and 3) versus non-pilin strains (Group 0) (n = 3 independent experiments for each strain; unpaired two-tailed Student’s t test: *p < 0.05, **p < 0.01, ***p < 0.001, ns, not significant; error bars: SEM). mRNA levels are normalized to a GAPDH reference and mean ± SEM are shown (full list from Supplementary Table 24). ‘Untreated’ group represents baseline expression in HCT-15 without bacterial co-culture. c, Predicted structure of VWF-containing families from Oscillibacter. 3TXA_A from the PDB was identified as the closest homologue to Cluster_148958 (the representative of this group), based on structural rather than sequence similarity (Methods). The comparison of protein structures between Cluster_148958 (regions modelled at >90% accuracy by Phyre2) and chain A of 3TXA) is shown.

Extended Data Fig. 9 Quantitative evaluation of MetaWIBELE.

a, b, BGC genes are enriched among proteins prioritized by MetaWIBELE’s unsupervised and supervised approaches of MetaWIBELE (full list from Supplementary Table 30). We quantified BGC genes using MetaWIBELE priority scores generated by the unsupervised approach (a) and the supervised approach (b). cf, In addition, assembly-based gene quantification from MetaWIBELE agrees well with reference-based quantification from HUMAnN among known proteins. c, MetaWIBELE identified most of the HUMAnN-detected protein families in the HMP2 dataset along with many unique proteins. Abundances assigned to proteins detected by both MetaWIBELE and HUMAnN were highly correlated over samples (Spearman’s correlation, two-tailed p < 2.2e–16) (d), had similar Bray-Curtis dissimilarity profiles between samples (e) and were highly correlated within samples (f). Box plot boxes indicate quartiles and whiskers show inner fences.

Extended Data Fig. 10 Potentially bioactive microbial protein families from marine ecosystems are prioritized by MetaWIBELE.

a, More than 80% (out of 469,542 in total) of the protein families from Red Sea metagenomes were uncharacterized, and more than 70% were novel proteins (proteins with remote homology or without homology to known proteins), which was on average 25% greater than (generally better-studied) human associated communities. b, These uncharacterized proteins were abundant across samples, indicating that they are likely to contribute to unknown but important biochemical functions within the ocean ecosystems. c, Further, MetaWIBELE prioritized 334,386 protein families which showed differential abundance (DA) between the epipelagic (EPI) and mesopelagic (MES) layers, still including ~80% uncharacterized protein families. Effect sizes are differences between mean log-scaled abundances among depth layers. Positive values indicate more abundance in the mesopelagic layer. d, Functional annotations of prioritized protein families for each category were assigned by MetaWIBELE. e, f, Enumeration of the prioritization score and fold enrichment (the ratio of the overlap to the expected overlap) of species and Pfam domains among highly prioritized protein families. The top 30 species and Pfam domains with the largest mean fold enrichment are listed in decreasing order. Effect size is as defined in c (full list from Supplementary Tables 32, 33).

Supplementary information

Supplementary Information

This file contains Supplementary Methods; Supplementary Discussion; Supplementary Notes 1-4; legends for Supplementary Tables 1–33 and Supplementary References

Reporting Summary

Supplementary Table 1

Supplementary Tables 2–33

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Bhosle, A., Bae, S. et al. Discovery of bioactive microbial gene products in inflammatory bowel disease. Nature 606, 754–760 (2022).

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI:


By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.


Quick links

Nature Briefing

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

Get the most important science stories of the day, free in your inbox. Sign up for Nature Briefing