Discovery of bioactive microbial gene products in inflammatory bowel disease

Zhang, Yancong; Bhosle, Amrisha; Bae, Sena; McIver, Lauren J.; Pishchany, Gleb; Accorsi, Emma K.; Thompson, Kelsey N.; Arze, Cesar; Wang, Ya; Subramanian, Ayshwarya; Kearney, Sean M.; Pawluk, April; Plichta, Damian R.; Rahnavard, Ali; Shafquat, Afrah; Xavier, Ramnik J.; Vlamakis, Hera; Garrett, Wendy S.; Krueger, Andy; Huttenhower, Curtis; Franzosa, Eric A.

doi:10.1038/s41586-022-04648-7

Article
Published: 25 May 2022

Discovery of bioactive microbial gene products in inflammatory bowel disease

Nature volume 606, pages 754–760 (2022)Cite this article

20k Accesses
30 Citations
115 Altmetric
Metrics details

Subjects

Abstract

Microbial communities and their associated bioactive compounds^1,2,3 are often disrupted in conditions such as the inflammatory bowel diseases (IBD)⁴. However, even in well-characterized environments (for example, the human gastrointestinal tract), more than one-third of microbial proteins are uncharacterized and often expected to be bioactive^5,6,7. Here we systematically identified more than 340,000 protein families as potentially bioactive with respect to gut inflammation during IBD, about half of which have not to our knowledge been functionally characterized previously on the basis of homology or experiment. To validate prioritized microbial proteins, we used a combination of metagenomics, metatranscriptomics and metaproteomics to provide evidence of bioactivity for a subset of proteins that are involved in host and microbial cell–cell communication in the microbiome; for example, proteins associated with adherence or invasion processes, and extracellular von Willebrand-like factors. Predictions from high-throughput data were validated using targeted experiments that revealed the differential immunogenicity of prioritized Enterobacteriaceae pilins and the contribution of homologues of von Willebrand factors to the formation of Bacteroides biofilms in a manner dependent on mucin levels. This methodology, which we term MetaWIBELE (workflow to identify novel bioactive elements in the microbiome), is generalizable to other environmental communities and human phenotypes. The prioritized results provide thousands of candidate microbial proteins that are likely to interact with the host immune system in IBD, thus expanding our understanding of potentially bioactive gene products in chronic disease states and offering a rational compendium of possible therapeutic compounds and targets.

Access through your institution

Buy or subscribe

This is a preview of subscription content, access via your institution

Access options

Access through your institution

Buy this article

Purchase on Springer Link
Instant access to full article PDF

Buy now

Prices may be subject to local taxes which are calculated during checkout

**Fig. 1: Many protein families in the IBD microbiome are uncharacterized and can be putatively annotated and prioritized for potential bioactivity.**

**Fig. 2: Prioritization of uncharacterized proteins implicated in potential bioactivity and association with IBD severity.**

**Fig. 3: A combination of known and novel pro-inflammatory Enterobacteriaceae pilins is highly prioritized as enriched in IBD dysbiosis.**

**Fig. 4: IBD-associated uncharacterized VWA-containing exoproteins depleted during inflammation are highly prioritized and suggest diverse mechanisms of maintenance of extracellular homeostasis.**

A host–microbiota interactome reveals extensive transkingdom connectivity

Article 20 March 2024

Nicole D. Sonnert, Connor E. Rosen, … Noah W. Palm

A metabolomics pipeline for the mechanistic interrogation of the gut microbiome

Article 14 July 2021

Shuo Han, Will Van Treuren, … Justin L. Sonnenburg

Microbiome and metabolome features in inflammatory bowel disease via multi-omics integration analyses across cohorts

Article Open access 06 November 2023

Lijun Ning, Yi-Lu Zhou, … Jie Hong

Data availability

Associated data generated during this study are included in the published Article and its Supplementary Tables. All assembled metagenomic contigs, ORFs, gene families, protein families, functional profiles, taxonomic profiles and prioritized profiles of protein families related with this study are available at http://huttenhower.sph.harvard.edu/metawibele. The raw data for the HMP2 metagenomes, metatranscriptomes and metaproteomes were obtained from the IBDMDB website (https://ibdmdb.org, NCBI BioProject PRJNA398089). Sequence data for the Red Sea metagenomes were obtained from SRA BioProject PRJNA289734. The following public databases were used: UniProt (https://www.uniprot.org), UniRef90 (https://www.uniprot.org/uniref), Pfam (https://pfam.xfam.org), DOMINE (https://manticore.niehs.nih.gov/cgi-bin/Domine), the Expression Atlas (https://www.ebi.ac.uk/gxa), SIFTS (https://www.ebi.ac.uk/pdbe/docs/sifts), the Database of Essential Genes (http://essentialgene.org) and the PDB (https://www.rcsb.org).

Code availability

The open-source MetaWIBELE software is available through http://huttenhower.sph.harvard.edu/metawibele. Manuals and online tutorials describing MetaWIBELE are available at https://github.com/biobakery/metawibele. User support is provided through the bioBakery help forum (https://forum.biobakery.org). Additional software details are provided in the Methods.

References

Cohen, L. J. et al. Commensal bacteria make GPCR ligands that mimic human signalling molecules. Nature 549, 48–53 (2017).
Article CAS PubMed PubMed Central ADS Google Scholar
Guo, C. J. et al. Discovery of reactive microbiota-derived metabolites that inhibit host proteases. Cell 168, 517–526 (2017).
Article CAS PubMed PubMed Central Google Scholar
Bhattarai, Y. et al. Gut microbiota-produced tryptamine activates an epithelial G-protein-coupled receptor to increase colonic secretion. Cell Host Microbe 23, 775–785 (2018).
Article CAS PubMed PubMed Central Google Scholar
Lloyd-Price, J. et al. Multi-omics of the gut microbial ecosystem in inflammatory bowel diseases. Nature 569, 655–662 (2019).
Article CAS PubMed PubMed Central ADS Google Scholar
Galperin, M. Y. & Koonin, E. V. ‘Conserved hypothetical’ proteins: prioritization of targets for experimental study. Nucleic Acids Res. 32, 5452–5463 (2004).
Article CAS PubMed PubMed Central Google Scholar
Galperin, M. Y. & Koonin, E. V. From complete genome sequence to ‘complete’ understanding? Trends Biotechnol. 28, 398–406 (2010).
Article CAS PubMed PubMed Central Google Scholar
Joice, R., Yasuda, K., Shafquat, A., Morgan, X. C. & Huttenhower, C. Determining microbial products and identifying molecular targets in the human microbiome. Cell Metab. 20, 731–741 (2014).
Article CAS PubMed PubMed Central Google Scholar
Buffie, C. G. et al. Precision microbiome reconstitution restores bile acid mediated resistance to Clostridium difficile. Nature 517, 205–208 (2015).
Article CAS PubMed ADS Google Scholar
Zipperer, A. et al. Human commensals producing a novel antibiotic impair pathogen colonization. Nature 535, 511–516 (2016).
Article CAS PubMed ADS Google Scholar
Morgan, X. C. et al. Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 13, R79 (2012).
Article CAS PubMed PubMed Central Google Scholar
Suzek, B. E., Huang, H., McGarvey, P., Mazumder, R. & Wu, C. H. UniRef: comprehensive and non-redundant UniProt reference clusters. Bioinformatics 23, 1282–1288 (2007).
Article CAS PubMed Google Scholar
Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).
Article CAS PubMed PubMed Central Google Scholar
UniProt Consortium. UniProt: a hub for protein information. Nucleic Acids Res. 43, D204–D212 (2015).
Article CAS Google Scholar
Konstantinidis, K. T. & Tiedje, J. M. Towards a genome-based taxonomy for prokaryotes. J. Bacteriol. 187, 6258–6264 (2005).
Article CAS PubMed PubMed Central Google Scholar
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Article CAS PubMed Google Scholar
Plaza Oñate, F. et al. MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data. Bioinformatics 35, 1544–1552 (2019).
Article PubMed CAS Google Scholar
Li, J. et al. An integrated catalog of reference genes in the human gut microbiome. Nat. Biotechnol. 32, 834–841 (2014).
Article CAS PubMed Google Scholar
Jandhyala, S. M. et al. Role of the normal gut microbiota. World J. Gastroenterol. 21, 8787–8803 (2015).
Article CAS PubMed PubMed Central Google Scholar
Zhang, R., Ou, H. Y. & Zhang, C. T. DEG: a database of essential genes. Nucleic Acids Res. 32, D271–D272 (2004).
Article CAS PubMed PubMed Central Google Scholar
Sokol, H. et al. Faecalibacterium prausnitzii is an anti-inflammatory commensal bacterium identified by gut microbiota analysis of Crohn disease patients. Proc. Natl Acad. Sci. USA 105, 16731–16736 (2008).
Article CAS PubMed PubMed Central ADS Google Scholar
Lopez-Siles, M., Duncan, S. H., Garcia-Gil, L. J. & Martinez-Medina, M. Faecalibacterium prausnitzii: from microbiology to diagnostics and prognostics. ISME J. 11, 841–852 (2017).
Article PubMed PubMed Central Google Scholar
Schirmer, M., Garner, A., Vlamakis, H. & Xavier, R. J. Microbial genes and pathways in inflammatory bowel disease. Nat. Rev. Microbiol. 17, 497–511 (2019).
Article CAS PubMed PubMed Central Google Scholar
Lewis, J. D. et al. Inflammation, antibiotics, and diet as environmental stressors of the gut microbiome in pediatric Crohn’s disease. Cell Host Microbe 18, 489–500 (2015).
Article CAS PubMed PubMed Central Google Scholar
Franzosa, E. A. et al. Gut microbiome structure and metabolic activity in inflammatory bowel disease. Nature Microbiol. 4, 293–305 (2019).
Article CAS Google Scholar
Hall, A. B. et al. A novel Ruminococcus gnavus clade enriched in inflammatory bowel disease patients. Genome Med. 9, 103 (2017).
Article PubMed PubMed Central CAS Google Scholar
Hughes, E. R. et al. Microbial respiration and formate oxidation as metabolic signatures of inflammation-associated dysbiosis. Cell Host Microbe 21, 208–219 (2017).
Article CAS PubMed PubMed Central Google Scholar
Högbom, M. & Ihalin, R. Functional and structural characteristics of bacterial proteins that bind host cytokines. Virulence 8, 1592–1601 (2017).
Article PubMed PubMed Central CAS Google Scholar
Wells, T. J., Tree, J. J., Ulett, G. C. & Schembri, M. A. Autotransporter proteins: novel targets at the bacterial cell surface. FEMS Microbiol. Lett. 274, 163–172 (2007).
Article CAS PubMed Google Scholar
Pizarro-Cerdá, J. & Cossart, P. Bacterial adhesion and entry into host cells. Cell 124, 715–727 (2006).
Article PubMed CAS Google Scholar
Palmela, C. et al. Adherent-invasive Escherichia coli in inflammatory bowel disease. Gut 67, 574–587 (2018).
Article CAS PubMed Google Scholar
Xu, Q. et al. A Distinct type of pilus from the human microbiome. Cell 165, 690–703 (2016).
Article CAS PubMed PubMed Central Google Scholar
Zhang, Y., Thompson, K. N., Huttenhower, C. & Franzosa, E. A. Statistical approaches for differential expression analysis in metatranscriptomics. Bioinformatics 37, i34–i41 (2021).
Article CAS PubMed PubMed Central Google Scholar
Starks, A. M., Froehlich, B. J., Jones, T. N. & Scott, J. R. Assembly of CS1 pili: the role of specific residues of the major pilin, CooA. J. Bacteriol. 188, 231–239 (2006).
Article CAS PubMed PubMed Central Google Scholar
Galkin, V. E. et al. The structure of the CS1 pilus of enterotoxigenic Escherichia coli reveals structural polymorphism. J. Bacteriol. 195, 1360–1370 (2013).
Article CAS PubMed PubMed Central Google Scholar
Vatanen, T. et al. Variation in microbiome LPS immunogenicity contributes to autoimmunity in humans. Cell 165, 842–853 (2016).
Article CAS PubMed PubMed Central Google Scholar
Dalbey, R. E. & Kuhn, A. Protein traffic in Gram-negative bacteria—how exported and secreted proteins find their way. FEMS Microbiol. Rev. 36, 1023–1045 (2012).
Article CAS PubMed Google Scholar
Costa, T. R. et al. Secretion systems in Gram-negative bacteria: structural and mechanistic insights. Nat. Rev. Microbiol. 13, 343–359 (2015).
Article CAS PubMed Google Scholar
Shipman, J. A., Berleman, J. E. & Salyers, A. A. Characterization of four outer membrane proteins involved in binding starch to the cell surface of Bacteroides thetaiotaomicron. J. Bacteriol. 182, 5365–5372 (2000).
Article CAS PubMed PubMed Central Google Scholar
Berman, H. M. et al. The Protein Data Bank. Nucleic Acids Res. 28, 235–242 (2000).
Article CAS PubMed PubMed Central ADS Google Scholar
Kelley, L. A., Mezulis, S., Yates, C. M., Wass, M. N. & Sternberg, M. J. The Phyre2 web portal for protein modeling, prediction and analysis. Nat. Protoc. 10, 845–858 (2015).
Article CAS PubMed PubMed Central Google Scholar
Dong, R., Pan, S., Peng, Z., Zhang, Y. & Yang, J. mTM-align: a server for fast protein structure database search and multiple protein structure alignment. Nucleic Acids Res. 46, W380–w386 (2018).
Article CAS PubMed PubMed Central Google Scholar
Treuner-Lange, A. et al. PilY1 and minor pilins form a complex priming the type IVa pilus in Myxococcus xanthus. Nat. Commun. 11, 5054 (2020).
Article CAS PubMed PubMed Central ADS Google Scholar
Co, J. Y. et al. Mucins trigger dispersal of Pseudomonas aeruginosa biofilms. NPJ Biofilms Microbiomes 4, 23 (2018).
Article PubMed PubMed Central ADS Google Scholar
Medema, M. H. et al. antiSMASH: rapid identification, annotation and analysis of secondary metabolite biosynthesis gene clusters in bacterial and fungal genome sequences. Nucleic Acids Res. 39, W339–W346 (2011).
Article CAS PubMed PubMed Central Google Scholar
Haroon, M. F., Thompson, L. R., Parks, D. H., Hugenholtz, P. & Stingl, U. A catalogue of 136 microbial draft genomes from Red Sea metagenomes. Sci. Data 3, 160050 (2016).
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

This work has been supported in part by a research agreement with Takeda Pharmaceuticals (C.H.) and by NIH NIDDK grants R24DK110499 (C.H., W.S.G. and R.J.X.), P30DK043351 (R.J.X.), the Center for Microbiome Informatics and Therapeutics (R.J.X.), NIH AT009708 (R.J.X.), and DK 127171 (R.J.X.). We especially appreciate the participants in the HMP2 Inflammatory Bowel Disease Multi-omics Database who made this study possible. The computations in this paper were run in part on the FASRC Cannon cluster supported by the FAS Division of Science Research Computing Group at Harvard University.

Author information

These authors jointly supervised this work: Curtis Huttenhower, Eric A. Franzosa

Authors and Affiliations

Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
Yancong Zhang, Amrisha Bhosle, Gleb Pishchany, Kelsey N. Thompson, Ya Wang, Ayshwarya Subramanian, Damian R. Plichta, Ali Rahnavard, Afrah Shafquat, Ramnik J. Xavier, Hera Vlamakis, Wendy S. Garrett, Curtis Huttenhower & Eric A. Franzosa
Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Yancong Zhang, Amrisha Bhosle, Sena Bae, Lauren J. McIver, Emma K. Accorsi, Kelsey N. Thompson, Cesar Arze, Ya Wang, Ayshwarya Subramanian, April Pawluk, Ali Rahnavard, Afrah Shafquat, Curtis Huttenhower & Eric A. Franzosa
Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Yancong Zhang, Amrisha Bhosle, Sena Bae, Lauren J. McIver, Emma K. Accorsi, Kelsey N. Thompson, Ya Wang, April Pawluk, Wendy S. Garrett, Curtis Huttenhower & Eric A. Franzosa
Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Sena Bae, Wendy S. Garrett & Curtis Huttenhower
Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, USA
Gleb Pishchany
Center for Communicable Disease Dynamics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
Emma K. Accorsi
Department of Civil and Environmental Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Sean M. Kearney
Gastrointestinal Unit and Center for the Study of Inflammatory Bowel Disease, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
Ramnik J. Xavier
Center for Microbiome Informatics and Therapeutics, Massachusetts Institute of Technology, Cambridge, MA, USA
Ramnik J. Xavier
Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
Wendy S. Garrett
Takeda Pharmaceutical Company, Cambridge, MA, USA
Andy Krueger

Authors

Yancong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Amrisha Bhosle
View author publications
You can also search for this author in PubMed Google Scholar
Sena Bae
View author publications
You can also search for this author in PubMed Google Scholar
Lauren J. McIver
View author publications
You can also search for this author in PubMed Google Scholar
Gleb Pishchany
View author publications
You can also search for this author in PubMed Google Scholar
Emma K. Accorsi
View author publications
You can also search for this author in PubMed Google Scholar
Kelsey N. Thompson
View author publications
You can also search for this author in PubMed Google Scholar
Cesar Arze
View author publications
You can also search for this author in PubMed Google Scholar
Ya Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ayshwarya Subramanian
View author publications
You can also search for this author in PubMed Google Scholar
Sean M. Kearney
View author publications
You can also search for this author in PubMed Google Scholar
April Pawluk
View author publications
You can also search for this author in PubMed Google Scholar
Damian R. Plichta
View author publications
You can also search for this author in PubMed Google Scholar
Ali Rahnavard
View author publications
You can also search for this author in PubMed Google Scholar
Afrah Shafquat
View author publications
You can also search for this author in PubMed Google Scholar
Ramnik J. Xavier
View author publications
You can also search for this author in PubMed Google Scholar
Hera Vlamakis
View author publications
You can also search for this author in PubMed Google Scholar
Wendy S. Garrett
View author publications
You can also search for this author in PubMed Google Scholar
Andy Krueger
View author publications
You can also search for this author in PubMed Google Scholar
Curtis Huttenhower
View author publications
You can also search for this author in PubMed Google Scholar
Eric A. Franzosa
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.Z., A.K., C.H. and E.A.F. designed the research. Y.Z., A.B., A. Subramanian, A.R., A. Shafquat and E.A.F. performed computational analysis. S.B. performed the experimental validation of pilins. G.P. performed the experiments for validating VWF-containing proteins. Y.Z. and L.J.M. implemented the software. Y.Z. and E.K.A. wrote the tutorial document and tested the software. C.A. and D.R.P. participated in generating the assembly data. Y.Z. and C.H. wrote the manuscript with feedback from the other authors. K.N.T., Y.W., S.M.K., A.P., E.A.F. and all other authors participated in editing the manuscript. R.J.X., H.V., W.S.G. and A.K. participated in interpretation of the primary findings. C.H. and E.A.F. supervised the research. All authors approved the final manuscript.

Corresponding authors

Correspondence to Curtis Huttenhower or Eric A. Franzosa.

Ethics declarations

Competing interests

C.H. is on the scientific advisory board of Seres Therapeutics and Empress Therapeutics. W.S.G. is on the scientific advisory board of Freya Biosciences, Senda Biosciences, Artizan Biosciences and Tenza. The laboratory of W.S.G. receives funding from Merck. R.J.X. is a member of the scientific advisory board of Nestle and Senda Biosciences. A.K. presents employment by Takeda that may gain or lose financially through this publication.

Peer review

Peer review information

Nature thanks Robert Quinn, Paul Wilmes and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data figures and tables

Extended Data Fig. 1 Overview of MetaWIBELE workflow and analysis summary in the HMP2 dataset.

a, MetaWIBELE identifies novel potentially bioactive gene products from microbial communities. MetaWIBELE prioritizes and partially annotates putatively bioactive gene products from shotgun metagenomes, using a combination of primary and secondary sequence properties, ecological distributions, and host or environmental phenotypes. The process begins with single-sample metagenomic assemblies, from which open reading frames are called and clustered into gene families. These are quantified, annotated (MetaWIBELE-characterize), and ranked by likely bioactivity (MetaWIBELE-prioritize). This results in proteins from across a set of communities with potential bioactivity in their environments of origin, annotated with the quantitative sources of this bioactivity evidence and per-family information such as abundance, taxonomic origin, and (when known) putative molecular roles. b, Quantitative characteristics of MetaWIBELE applied to the 1,595 metagenomes in the HMP2. Overall strategy used by MetaWIBELE for protein family construction, annotation, and prioritization, and the associated input data and results when applied to datasets used for identifying microbial gene products with potential bioactivity in HMP2. SC: Strong homology to known characterized proteins, SU: Strong homology to known uncharacterized proteins, RH: Remote homology to known proteins, NH: No homology to known proteins. TM: transmembrane. DDI: domain-domain interaction.

Extended Data Fig. 2 Uncharacterized protein families have comparable abundance distribution and sequence composition to known proteins.

a, Nominally characterized and uncharacterized protein families were distinguished with homology-based search against UniRef90 (release 2019_01). We defined strong homology following the UniRef90 criterion of ≥90% identity and ≥80% coverage, remote homology as identity from 25% to 90% and coverage from 25% to 80%, and non-homologous proteins as those with <25% identity or <25% coverage or no hit to UniRef90 proteins. Here, we use ‘uncharacterized known proteins’ to refer to UniRef90 proteins that do not have any Gene Ontology annotations in UniProt (release 2019_01). Distribution of prevalences and abundances of protein families across the four categories of protein families. b, The fractions of novel proteins (proteins with remote homology or without homology to known proteins) are comparable to known proteins across samples. c, Bray-Curtis dissimilarities over protein family profiles between samples from different participants, samples from the same participant over time, and technical replicates. Variability among novel proteins was more extreme than among known proteins, but less extreme than among known proteins with rare abundance (bottom 50%). Box plot boxes indicate quartiles and whiskers show inner fences. d, Uncharacterized proteins with comparable abundance to known proteins fit a neutral model of microbiome assembly (Methods). ‘Unclassified taxon’ indicates a group of genes which lack taxonomic information but can be binned into the same MSP based on co-abundance information. e–g, Uncharacterized proteins showed similar sequence composition with known proteins. Characterized and uncharacterized proteins had similar distributions of lengths of assembled contigs (e), protein lengths (f) and GC content (g).

Extended Data Fig. 3 An integrated annotation approach characterizes millions of gut microbial protein families.

a, We enumerated the degree to which annotations based on local homology or secondary structure could be assigned by MetaWIBELE. ‘InterPro signatures’ represents protein signatures in the InterPro except Pfam domains. ‘Interaction’ means domain-domain interactions as predicted by DOMINE. ‘Others’ includes other types of protein subcellular localization. (e.g., cytoplasmic, membrane, periplasmic, etc.). ‘Unknown function’ represents proteins without any putative biochemical annotations. b, Most of the functional annotations assigned by MetaWIBELE were consistent with those in UniProt when evaluated on characterized proteins. ‘UniProt_unique’ annotations are quite rare, indicating the good sensitivity of MetaWIBELE. Meanwhile, ‘MetaWIBELE_unique’ annotations are also in the minority, which could be a ceiling on false positives, but there are likely to be many false negatives from UniProt as well. c, Each row represents one type of annotation. Each column indicates the number of protein families with corresponding annotation types (indicated with black point) intersection. The ‘Unclassified taxonomy’ category represents protein families without taxonomy information. ‘MSPs’ (metagenomic species pangenomes) are built by binning co-abundant genes across metagenomic samples. ‘Domains’ are domain-based annotations including Pfam domains and domain-domain interactions. ‘Host facing’ indicates annotations which are likely to be involved in host-microbial interactions (e.g., signal peptides, transmembrane). ‘InterPro signatures’, ‘Others’ and ‘Unknown function’ are defined in a.

Extended Data Fig. 4 Novel protein families can be taxonomically annotated and greatly expand pangenomes of common gut taxa.

a, Schematic of MetaWIBELE’s guilt-by-association approach for per-protein-family taxonomic annotation leveraging co-abundance profiles (MSPs). If reference sequence annotations are consistent within a group of co-varying proteins, their most-specific shared taxonomy can be transferred to other sequences within the family. b, We validated this novel taxonomic annotation method on a 20% holdout set of known proteins. c, To optimize the parameters, we tested different cut-offs for the fraction of protein families between the most and second-most dominant taxon within MSP using the holdout set in b. Stringent cut-offs (i.e., requiring more consistently classified taxa) reduced the power of taxonomic assignment for more specific levels (e.g., species or genus) but controlled false positives. Lenient cut-offs (i.e., requiring less consistently classified taxa) introduced more spurious assignments with good sensitivity to the assignment of species or genus. This sensitivity-specificity trade-off is best-balanced at our default cut-off value of 0.5. d, Comparison of taxonomic annotations by homology-based and guilt-by-association approaches. e, The top 25 genera with the highest number of newly annotated proteins (Supplementary Table 3). The first row indicates the number of genomes in RefSeq per genus. The second row indicates the mean relative abundance of known (i.e., SC and SU) and novel proteins (RH and NH), in which red dots represent the mean of known proteins and blue dots represent the mean of novel proteins. f, Uncharacterized proteins expanded common gut taxa. Each clade represents one genus. Circle bars show relative abundance of different categories of protein families. g, Similar representative genera with dominant abundance were identified in HMP2 and MetaHIT. The top 50 genera (with highest mean abundance) were selected for plotting. Box plot boxes indicate quartiles and whiskers show inner fences.

Extended Data Fig. 5 Essential genes are assigned higher priority scores using MetaWIBELE’s unsupervised approach.

a, Prevalence and abundance of 1.6M protein families from the HMP2 metagenomes. Essential proteins (based on DEG homology, see full list from Supplementary Table 4) were enriched in proteins prioritized by the harmonic mean of these values. b, When assumed to be true positives (i.e., ‘important’ proteins), essential proteins were notably well-predicted by ecological properties. This was true across a range of beta parameter settings: i.e., the relative weight for prevalence versus abundance in the calculation of a unified priority score (higher beta implies more weight assigned to prevalence). c–e, Distributions of prevalence (c), abundance (d) and priority scores (e) are plotted for all proteins and essential genes, respectively. Box plot boxes indicate quartiles, and whiskers show inner fences.

Extended Data Fig. 6 Protein families associated with severe IBD phenotypes.

a, A total of 348,973 protein families were prioritized as potentially bioactive in IBD, with all four categories of homology-based characterization dominated by proteins with decreased differential abundance (DA) during dysbiosis. The integrated priority score is a meta-rank combining both phenotypic significance and effect size of DA with ecological properties (abundance and prevalence). DA p-values are from modified linear models (Methods), and effect sizes are differences between means log-scaled abundances among phenotypes. Positive values indicate more abundance in ‘cases’ (i.e., the dysbiotic state of Crohn’s disease (CD) or ulcerative colitis (UC)). b, Functional annotations assigned to DA prioritized protein families by global homology (top left) or local structural properties (Methods). c, More protein families were depleted in dysbiosis samples than enriched in dysbiosis samples. The largest source of DA families (n = 1,595 from 130 participants; linear mixed-effects model: adjusted p-value with Benjamini–Hochberg FDR correction < 0.05) corresponded with the differences between dysbiotic and non-dysbiotic samples from individuals with CD, whereas those with UC were less well separated. The effect size was computed as the difference of mean values in the dysbiotic condition compared to the non-dysbiotic condition within each IBD phenotype at the log-transformed scale. d, Highly prioritized protein families of Ruminococcus gnavus were grouped into multiple MSPs. Most such R. gnavus proteins were enriched in the dysbiotic states of IBD and fell into msp_127 and msp_306, whereas a few proteins were depleted in dysbiosis and failed to cluster as MSP members (full list from Supplementary Table 12). e, Highly prioritized protein families of Faecalibacterium prausnitzii were grouped into multiple MSPs and tended to be depleted in dysbiosis (full list from Supplementary Table 13).

Extended Data Fig. 7 Potentially bioactive protein families are validated by metaproteomics.

a, Both known and novel proteins showed metaproteomics (MPX) evidence, though only a small fraction of protein families were detected owing to the relatively low coverage of all metaproteomics in the HMP2. ‘MPX-prevalent’ refers to a set with relatively higher prevalence in MPX samples for more consistent detection, in which we thresholded the mean value of the prevalence of proteins in MPX samples (full list from Supplementary Table 7). b, Among the MPX validated proteins, the fraction of prioritized novel proteins (e.g., RH and NH) was comparable to the known protein (e.g., SC and SU). c, The prioritized proteins were significantly enriched in the set of proteins with MPX evidence (two-tailed Fisher’s exact test; adjusted p-value with Benjamini–Hochberg FDR correction < 2.2e–16 for SC and SU, 3.3e-258 for RH, 1.4e-21 for NH). d, e, Protein families profiled by MPX had significantly higher priority scores for both known and novel proteins (GSEA method; FDR-adjusted P = 0.0012 for ‘MPX-prevalence’ regardless of characterization categories in d and stratified in SC, SU, RH in e, and FDR-adjusted P = 0.0051 for RH category in e; Supplementary Table 8). Prioritization distribution of ‘MPX-prevalent’ proteins with different characterization levels are shown in e.

Extended Data Fig. 8 Supporting evidence for the bioactivity of Enterobacteriaceae pilus components and VWF homologues.

a, Effect of bacterial co-culture on pilin gene expression in bacterial strains. Expression of a subset of highly prioritized bacterial pilin genes by RT–qPCR is normalized to rpoA. b, Expression of other cytokines in HCT-15 cells after co-culture with pilin-encoding strains (Group 1 and 3) versus non-pilin strains (Group 0) (n = 3 independent experiments for each strain; unpaired two-tailed Student’s t test: *p < 0.05, **p < 0.01, ***p < 0.001, ns, not significant; error bars: SEM). mRNA levels are normalized to a GAPDH reference and mean ± SEM are shown (full list from Supplementary Table 24). ‘Untreated’ group represents baseline expression in HCT-15 without bacterial co-culture. c, Predicted structure of VWF-containing families from Oscillibacter. 3TXA_A from the PDB was identified as the closest homologue to Cluster_148958 (the representative of this group), based on structural rather than sequence similarity (Methods). The comparison of protein structures between Cluster_148958 (regions modelled at >90% accuracy by Phyre2) and chain A of 3TXA) is shown.

Extended Data Fig. 9 Quantitative evaluation of MetaWIBELE.

a, b, BGC genes are enriched among proteins prioritized by MetaWIBELE’s unsupervised and supervised approaches of MetaWIBELE (full list from Supplementary Table 30). We quantified BGC genes using MetaWIBELE priority scores generated by the unsupervised approach (a) and the supervised approach (b). c–f, In addition, assembly-based gene quantification from MetaWIBELE agrees well with reference-based quantification from HUMAnN among known proteins. c, MetaWIBELE identified most of the HUMAnN-detected protein families in the HMP2 dataset along with many unique proteins. Abundances assigned to proteins detected by both MetaWIBELE and HUMAnN were highly correlated over samples (Spearman’s correlation, two-tailed p < 2.2e–16) (d), had similar Bray-Curtis dissimilarity profiles between samples (e) and were highly correlated within samples (f). Box plot boxes indicate quartiles and whiskers show inner fences.

Extended Data Fig. 10 Potentially bioactive microbial protein families from marine ecosystems are prioritized by MetaWIBELE.

a, More than 80% (out of 469,542 in total) of the protein families from Red Sea metagenomes were uncharacterized, and more than 70% were novel proteins (proteins with remote homology or without homology to known proteins), which was on average 25% greater than (generally better-studied) human associated communities. b, These uncharacterized proteins were abundant across samples, indicating that they are likely to contribute to unknown but important biochemical functions within the ocean ecosystems. c, Further, MetaWIBELE prioritized 334,386 protein families which showed differential abundance (DA) between the epipelagic (EPI) and mesopelagic (MES) layers, still including ~80% uncharacterized protein families. Effect sizes are differences between mean log-scaled abundances among depth layers. Positive values indicate more abundance in the mesopelagic layer. d, Functional annotations of prioritized protein families for each category were assigned by MetaWIBELE. e, f, Enumeration of the prioritization score and fold enrichment (the ratio of the overlap to the expected overlap) of species and Pfam domains among highly prioritized protein families. The top 30 species and Pfam domains with the largest mean fold enrichment are listed in decreasing order. Effect size is as defined in c (full list from Supplementary Tables 32, 33).

Supplementary information

Supplementary Information

This file contains Supplementary Methods; Supplementary Discussion; Supplementary Notes 1-4; legends for Supplementary Tables 1–33 and Supplementary References

Reporting Summary

Supplementary Table 1

Supplementary Tables 2–33

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zhang, Y., Bhosle, A., Bae, S. et al. Discovery of bioactive microbial gene products in inflammatory bowel disease. Nature 606, 754–760 (2022). https://doi.org/10.1038/s41586-022-04648-7

Download citation

Received: 16 October 2020
Accepted: 15 March 2022
Published: 25 May 2022
Issue Date: 23 June 2022
DOI: https://doi.org/10.1038/s41586-022-04648-7

This article is cited by

Integrated annotation prioritizes metabolites with bioactivity in inflammatory bowel disease
- Amrisha Bhosle
- Sena Bae
- Curtis Huttenhower
Molecular Systems Biology (2024)
Dubosiella newyorkensis modulates immune tolerance in colitis via the L-lysine-activated AhR-IDO1-Kyn pathway
- Yanan Zhang
- Shuyu Tu
- Shu Jeffrey Zhu
Nature Communications (2024)
Integrating the serum proteomic and fecal metaproteomic to analyze the impacts of overweight/obesity on IBD: a pilot investigation
- Ping Yan
- Yang Sun
- Yinglei Miao
Clinical Proteomics (2023)
Cardiometabolic health, diet and the gut microbiome: a meta-omics perspective
- Mireia Valles-Colomer
- Cristina Menni
- Nicola Segata
Nature Medicine (2023)
Challenges and opportunities in sharing microbiome data and analyses
- Curtis Huttenhower
- Robert D. Finn
- Alice Carolyn McHardy
Nature Microbiology (2023)

Comments

By submitting a comment you agree to abide by our Terms and Community Guidelines. If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.