Abstract
The gut microbiome has been implicated in various human diseases, though findings across studies have shown considerable variability. In this study, we reanalyzed 6314 publicly available fecal metagenomes from 36 case-control studies on different diseases to investigate microbial diversity and disease-shared signatures. Using a unified analysis pipeline, we observed reduced microbial diversity in many diseases, while some exhibited increased diversity. Significant alterations in microbial communities were detected across most diseases. A meta-analysis identified 277 disease-associated gut species, including numerous opportunistic pathogens enriched in patients and a depletion of beneficial microbes. A random forest classifier based on these signatures achieved high accuracy in distinguishing diseased individuals from controls (AUC = 0.776) and high-risk patients from controls (AUC = 0.825), and it also performed well in external cohorts. These results offer insights into the gut microbiome’s role in common diseases in the Chinese population and will guide personalized disease management strategies.
Similar content being viewed by others
Introduction
The gut microbiota is currently considered a key factor contributing to the regulation of host health1,2. Generally, the overall structure of the gut microbiota is relatively stable despite acute perturbations because of its plasticity, which allows it to quickly return to its initial composition3. However, when hosts are continuously exposed to various pollutants, stresses and diseases, the composition of the gut microbiota might change (dysbiosis), promoting the selection of more virulent microorganisms and potentially harming host health3. With the advancement of amplification-based and whole-metagenomic sequencing technologies, gut microbiota dysbiosis has been widely reported in many common diseases, including autoimmune disorders4,5,6, cardiometabolic conditions7,8, infectious diseases9, psychiatric disorders10,11, and cancers12,13,14. The altered microbiome likely plays a crucial role in these diseases. The varying microbial changes across different diseases emphasize the diverse roles of the microbiota in health and disease states15,16,17,18. However, due to the lack of unified reference databases, the low accuracy of bacterial species annotation and quantification in high-throughput sequencing datasets, and the highly variable experimental and analytical methods used, the signatures of the gut microbiota in different disease states are often incomparable. Additionally, the precise cause of microbial dysfunction in these diseases is not completely understood. Therefore, uniform methods to characterize the gut microbiota in multiple diseases, especially using publicly available datasets, are necessary to identify the overall pattern of disease-associated microbiota shifts.
Meta-analysis, which combines data from multiple studies, can help avoid biases inherent in individual studies19,20,21,22,23 and has proven effective in signature identification and diagnosis across diseases under large-scale microbiome datasets15,24,25. In this study, we reanalyzed raw sequencing data from publicly available metagenomic datasets, comprising 6314 human fecal samples (3728 patients with 28 disease or unhealthy statuses and 2586 healthy controls) from 36 studies of the Chinese population (Supplementary Table 1), using a unified pipeline. The differences in microbial diversity and compositional structure between cases and controls in each study were illustrated by conducting a comparative analysis. Next, integrated meta-analysis was applied to identify the microbial signatures that are universal to diseases across all studies, and machine learning classifiers were established based on the abundance of microbial signatures to investigate the potential of generic gut microbial features in predicting disease states. Our results can be applied to gut microbiological prediction for disease status and guide future specific interventions in different diseases.
Results
Datasets and overview of the gut microbiome
To investigate the gut microbial signatures across various diseases, we collected publicly available metagenomic data of human fecal samples from 36 case-control studies spanning 13 provinces of China (Supplementary Fig. 1a, b). For each study, samples with abnormal body mass index (BMI) (if metadata was available) or a low amount of data (<10 million reads) were removed. These datasets contained samples of 3728 patients with 28 different disease or unhealthy statuses, including immune (8 diseases from 9 studies), cardiometabolic (7 diseases from 10 studies), infectious (3 diseases from 3 studies), digestive (3 diseases from 3 studies), and psychiatric (2 diseases from 3 studies) diseases, cancers (2 diseases from 3 studies), and 4 other diseases, and their accompanying 2586 healthy controls (Fig. 1a; Supplementary Table 1). Notably, both patients and controls exhibited significant diversity in gender, age, and BMI across these datasets (Supplementary Fig. 1c), providing a comprehensive perspective on the characteristics of the Chinese population. A total of 6314 fecal metagenomes were processed with a unified pipeline, resulting in 50.3 Tbp of high-quality nonhuman metagenomic data for further analysis.
The gut microbial compositions of all samples were profiled based on the MetaPhlan426. At the phylum level, Bacteroidetes and Firmicutes were the most abundant phyla, with an average relative abundance of 48.4% ± 15.7% (ranging from 6.3% to 71.4%) and 42.7% ± 14.1% (ranging from 21.6% – 73.5%), respectively, across the 36 studies (Fig. 1b; Supplementary Fig. 2a). These were followed by Proteobacteria (average relative abundance 4.5% ± 2.4%, ranging from 0.7% – 14.2%) and Actinobacteria (average relative abundance 2.8% ± 3.8%, ranging from 0.2% – 19.0%). Other phyla combined account for only an average of 1.2% ± 0.7% of the abundance in the gut microbiome. The proportions of these phyla were highly diverse across studies, likely reflecting that the differences in geography or experimental methods (e.g., sample preparation and DNA extraction) of different studies may lead to tremendous heterogeneity in the gut microbial compositions (Supplementary Fig. 2b)27,28. At the genus level, Phocaeicola (average relative abundance 15.6% ± 5.4%), Bacteroides (14.2 ± 5.6%), Prevotella (11.6% ± 7.5%), Faecalibacterium (5.8% ± 3.3%), Alistipes (3.6% ± 2.3%), and Roseburia (2.7% ± 1.1%) were the dominant genera (Fig. 1c); these genera were distributed in a similar fashion across all studies except WanY_2021. Enterotype analysis based on the genus profiles of all samples generated two enterotypes characterized by Bacteroides (75.7% of samples) and Prevotella (24.3% of samples) (Supplementary Fig. 3); this result differed from that in previous studies classifying the human gut microbiome into three enterotypes29,30, probably because their boundaries are indistinguishable in our large-scale datasets. At the species level, several members of Phocaeicola (including P. vulgatus, P. plebeius, and P. dorei), Bacteroides (including B. uniformis and B. stercoris), Prevotella copri, and Faecalibacterium prausnitzii were dominant across all analyzed samples (Fig. 1d).
The gut microbiome is associated with multiple diseases
To illustrate the alteration of the gut microbiome in disease, we conducted a comparative analysis of microbial diversity and compositional structure between cases and controls within each study. Analyses were performed for 40 case-control comparisons, as the disease groups of several studies were subdivided by subtypes (e.g., Crohn’s disease [CD] and ulcerative colitis [UC] for inflammatory bowel disease [IBD]) or severity (e.g., prehypertension and hypertension) (Fig. 1a). First, we found that 11 case-control comparisons showed a significant decrease in species richness (estimated by the number of observed species) in disease groups compared with their corresponding control groups, whereas only 1 case-control comparison showed a significant increase in richness and diversity (Wilcoxon rank-sum test, q < 0.05 after adjusted for sex, age, and BMI; Fig. 2a, b; Supplementary Table 2). Similarly, 12 case-control comparisons exhibited lower species diversity (estimated by the Shannon diversity index) in disease subjects compared to controls, whereas only 2 comparisons showed a higher level. The most prominent disease was Crohn’s disease, which was associated with over 10% decreases in both species’ richness and diversity indexes across two comparisons (HeQ_2017 and WengY_2019.CD). Subsequently, patients with COVID-19 infection (YeohYK_2021), pulmonary tuberculosis (PT) (HuY_2019), hypertension (LiJ_2017.HTN and YanQ_2017), systemic lupus erythematosus (SLE) (ChenB_2020), liver cirrhosis (LC) (QinN_2014), gout (ChuY_2021), Graves’ disease (GD) (ZhuQ_2021) also had an over 10% decrease in species richness and diversity, and one study of ankylosing spondylitis (AS) (ZhouC_2020) also showed over 20% decrease in species richness and diversity. Conversely, increases in species richness and diversity were found in patients with Parkinson’s disease (PD) (QianY_2020 and MaoL_2021) and atrial fibrillation (AF) (ZuoK_2019).
Permutational multivariate analysis of variance (PERMANOVA) of the gut microbial composition within each study showed that in 27 of 40 case-control comparisons, the disease state significantly impacted the overall compositional structure of the gut microbiome (adonis p < 0.05; Fig. 2c). Among these subjects, patients with CD (HeQ_2017 and WengY_2019.CD), polycystic ovary syndrome (QiX_2019), AF (ZuoK_2019), GD (ZhuQ_2021), SLE (ChenB_2020), LC (QinN_2014), PT (HuY_2019), and COVID-19 (YeohYK_2021) had the greatest change in their gut microbiomes. We trained random forest models to classify cases and controls within each disease. The models achieved a considerably high classifiability based on the area under the receiver operating characteristic curve (AUC) > 0.7 in 28 of 40 case-control comparisons and an average AUC of 0.759 across all comparisons (Fig. 2d). Taken together, our results indicated profound changes in the gut microbiome in many different diseases.
Shared gut microbial signatures across diseases
Previous studies have demonstrated that many diseases have some common gut microbiome signatures15. In our dataset, PERMANOVA across all samples revealed that disease status can significantly impact the overall gut microbiome, with a modest effect size (R2 = 0.43%, adonis p < 0.001). As a comparison, the individuals’ sex, age, and BMI together explained only 0.26% (adonis p < 0.001) of the gut microbiome variation. These findings suggested the existence of shared gut species signatures for healthy individuals. Considering these findings, we next sought to identify the microbial signatures that are universal to diseases, using an algorithm that combining the random effects meta-analysis and phenotype-adjusted Masslin2 across all studies (see Methods). We identified 277 species that differed in relative abundance between diseased individuals and healthy controls (Supplementary Table 3). A total of 194 differentially abundant species were more abundant in the healthy controls than in the diseased subjects, while 83 species were enriched in the disease group.
Control-enriched and disease-enriched species were markedly separated in their taxonomic distribution at the phylum and genus levels (Fig. 3a). Most of the control-enriched species belonged to the phylum Firmicutes, including members of Clostridium (containing C. fessum, C. leptum, and a variety of unclassified species), Blautia (e.g., B. faecicola, B. glucerasea, B. massiliensis, and B. stercoris), Roseburia (e.g., R. intestinalis, R. faecis, R. hominis, and R. inulinivorans), Faecalibacterium (e.g., F. prausnitzii), and Ruminococcus (e.g., R. bicirculans, R. bromii, R. callidus, and R. lactaris). The remaining control-enriched species included several Bacteroidetes members, such as species of Bacteroides (e.g., B. cellulosilyticus, B. eggerthii, B. faecis, B. finegoldii, B. intestinalis, and B. uniformis) and Alistipes (e.g., A. inops, A. senegalensis, and A. shahii). The reductions in the abundances of many of these species were previously identified by the corresponding studies; these results were also reproduced in our analyses (Fig. 3b). Notably, members of Faecalibacterium, Roseburia, and Blautia are the most important producers of short-chain fatty acids (SCFAs) in the human gut31,32,33; this finding was in agreement with previous speculations, which suggested that reduced SCFA biosynthesis ability is a common characteristic for human diseases34.
Conversely, the disease-enriched species were mainly opportunistic pathogens (Fig. 3a, b). For example, the overgrowth of Streptococcus (containing S. anginosus, S. constellatus, S. gallolyticus, S. gordonii, S. infantis, S. mutans, S. oralis, and S. parasanguinis), Enterocloster (e.g., E. aldensis, E. bolteae, E. citroniae, and E. clostridioformis), Escherichia coli, and Erysipelatoclostridium ramosum have previously been found in many diseases. Members of Fusobacterium (including F. mortiferum, F. nucleatum, F. pseudoperiodonticum, F. ulcerans, and F. varium) and Hungatella hathewayi are typical bacteria associated with colorectal cancer (CRC)35,36. Eggerthella lenta and Flavonifractor plautii were reported to be enriched in end-stage renal disease (ESRD) patients with uremic toxin-producing effects37, and have also been associated with several chronic diseases such as asthma38, multiple sclerosis39, and CRC40. Additionally, several members of Lactobacillus, including L. amylovorus, L. crispatus, L. gasseri, and L. mucosae, were enriched in disease subjects.
At the genus level, we identified 107 genera with significant differences in relative abundance between diseased and healthy subjects, with 73 being more abundant in controls and 34 in patients (Supplementary Fig. 4; Supplementary Table 4). Consistent with the species-level results, several genera involved in SCFA production such as Faecalibacterium, Roseburia, and Butyricicoccus were reduced in diseased subjected compared to healthy controls, whereas some harmful bacteria such as Fusobacterium, Escherichia, and Enterocloster were enriched in patients.
Disease characteristics with respect to universal gut species signatures
Considering the universal gut species signatures, we categorized the diseases into several groups based on the gross relative abundances of control-enriched (proposed as “healthy bacteria”) and disease-enriched species (proposed as “unhealthy bacteria”). First, among 13 of 40 case-control comparisons, the microbiomes of patients were characterized by a significant depletion of healthy bacteria accompanied by an expansion of unhealthy bacteria (Fig. 4). These comparisons involved 10 diseases, including breast cancer, CRC (YuJ_2017), hypertension (YanQ_2017), obesity (ZengQ_2021 and LiuR_2017), type 2 diabetes (T2D) (QinJ_2012), IBD (both CD and UC), GD, SLE, COVID-19, and LC. Second, 7 comparisons involving gout, hypertension (LiJ_2017), prehypertension, AS (ZhouC_2020), PT, autism spectrum disorder (ASD) (WanY_2021), and PD (QianY_2021) were characterized by an isolated depletion of healthy bacteria. For example, the gut microbiota of gout patients showed a marked decrease in the abundance of SCFA producers, but only a few harmful bacteria (e.g., Parabacteroides distasonis and Hungatella effluvii) were enriched. Third, 8 comparisons involving CRC (Yang_2021), AF, atherosclerotic cardiovascular disease (ACVD), irritable bowel syndrome (IBS), PCOS, rheumatoid arthritis (RA), and ESRD featured an isolated enrichment of unhealthy bacteria. An example of these diseases is AF, which caused serious overgrowth of Streptococcus and Enterococcus spp., although the abundance of beneficial bacteria did not decrease. Finally, for the remaining 12 case-control comparisons, no significant changes were observed in the abundances of either healthy or unhealthy bacteria. These comparisons considered almost all psychiatric and neurological diseases (i.e., ASD, schizophrenia, and PD), HIV infection, 3 immune diseases (Behcet’s disease, myasthenia gravis, and Vogt-Koyanagi-Harada disease), and 3 cardiometabolic diseases (carotid atherosclerosis, bone mass loss, and prediabetes/treatment-naive T2D).
Prediction of disease status using gut microbial signatures
To investigate the potential of the universal gut microbial signatures in predicting disease status, we trained a random forest classifier based on the relative abundances of 277 differentially abundant species and tested its performance using a tenfold cross-validation approach. This classifier achieved an AUC of 0.776 (95% confidence interval [CI], 0.764–0.787) in classifying cases and controls across all investigated samples (Fig. 5a). In addition, it achieved an AUC of 0.825 (95% CI, 0.806-0.845) in distinguishing the patients with high-risk diseases from their corresponding control subjects, highlighting the high predictability of the statuses of these patients based on the gut microbial signatures. In terms of importance, some disease-enriched species, including Ligilactobacillus salivarius, Ruminococcus gnavus, Clostridium symbiosum, Streptococcus parasanguinis, Mediterraneibacter glycyrrhizinilyticus, Eggerthella lenta, Fusobacterium mortiferum, Blautia spp. (B. producta and B. hansenii), and Peptostreptococcus stomatis featured the highest scores for the discrimination of patients and healthy controls (Fig. 5b). Random forest analysis at the genus level also generated an AUC of 0.787 (95% CI, 0.775–0.798) for classifying all patients from controls and 0.823 (95% CI, 0.803–0.843) for classifying high-risk patients from controls (Supplementary Fig. 5a). Similar classification results were also observed within different disease subtypes as well (Supplementary Fig. 5b). Interestingly, reducing the set of signatures to the 80 most important species generated AUCs close to those obtained using all signatures for classifying cases versus controls (Supplementary Fig. 5c), suggesting that a minimal set of gut microbial signatures might be explored in the future. Additionally, analyses using other classifiers based on the least absolute shrinkage and selection operator (LASSO) algorithm did not achieve satisfactory results, with AUCs of 0.735 (95% CI, 0.723–0.747) and 0.705 (95% CI, 0.692–0.717) at the species and genus levels, respectively (Supplementary Fig. 5d).
We generated a random forest-based gut microbiome health index (GMHI) for each metagenomic sample based on the predictive values averaged from ten random forest classifiers (see Methods). The GMHI theoretically ranged from 0 (extreme healthy state) to 1 (extreme disease state), with actual values of 0.655 ± 0.151 (mean±S.D.) and 0.491 ± 0.141 for all disease and healthy individuals, respectively (Mann–Whitney U test, p ≪ 2.2 × 10−16; Cliff’s delta = 0.571; Fig. 5c). A total of 81.0% (727/898) of metagenome samples with GMHIs < 0.4 were from the healthy group, and 80.6% (2,346/2,912) of metagenome samples with GMHIs >0.6 were of nonhealthy origin. Moreover, we found strong correlations between the GMHI and gut microbial diversity indexes (Spearman’s ρ = -0.37 for richness and ρ = -0.36 for diversity; Supplementary Fig. 6a). However, considering that the predictive power of microbial diversity was very poor (Supplementary Fig. 6b), we concluded that the GMHI is a more effective indicator of health.
Finally, to validate the reliability of our classifiers, we analyzed the fecal metagenomes from three independent public cohorts: (1) bipolar depression41, (2) CRC42, and (3) ESRD43. Another newly recruited cohort comprising 234 patients with autoimmune diseases (including 95 RA patients, 73 SLE patients, and 66 SS patients) and 118 healthy controls were also included for validation. Using these cohorts, we quantified the relative abundances of the 277 disease-associated species and compared them between cases and controls. The AUCs of the original random forest classifier on these new cohorts were 0.637 (95% CI, 0.533–0.741), 0.838 (95% CI, 0.736–0.940), and 0.836 (95% CI, 0.786–0.887) for BD, CRC, and ESRD versus controls, respectively (Fig. 5d), suggesting that the classifier may perform better for high-risk diseases like CRC and ESRD, whereas the discrimination for psychiatric diseases was less evident. In autoimmune diseases, the original random forest classifier achieved an AUC of 0.555 (95% CI, 0.459–0.651), 0.638 (95% CI, 0.545–0.732), and 0.717 (95% CI, 0.611–0.823) for RA, SLE, and SS, respectively (Fig. 5e). Consistent with previous results, the low discrimination performance for RA may be related to the relatively smaller changes in the gut microbiome. Overall, these findings suggested that the generalized disease-associated gut microbial signatures identified in this study can accurately classify multiple disease states from a healthy state.
Discussion
The human gut harbors a vast array of microbes that significantly influence host health and disease status. Gut dysbiosis not only affects gastrointestinal diseases (e.g., diarrhea and IBS) but also plays a role in immune-related diseases (e.g., IBD, RA, multiple sclerosis, and allergies), central nervous system conditions (Alzheimer’s disease, autism, and PD), and disorders in host energy metabolism (e.g., obesity, atherosclerosis, and T2D)44,45,46,47,48. In this study, we employed a unified meta-analysis pipeline, starting from raw metagenomic reads of 6,314 human fecal samples from the Chinese population, to conduct a comprehensive analysis of the gut microbiome characteristics of these samples. We found that gut species richness and diversity were significantly decreased in patients with various diseases, corroborating previous studies that suggest decreased microbial diversity may be linked to poor health status in humans49,50,51. Our analysis revealed that the overall microbiome structures are markedly altered in most of the investigated diseases; in particular, the patients with several disease statuses, including IBD, PCOS, COVID-19 infection, SLE, and LC, experienced the most prominent alteration in their gut microbiota. Profound variation of the gut microbiota was also found in previous disease-specific studies52,53,54,55,56. Typically, it is not clear whether the changes in the microbiota are a cause or a consequence of these diseases; the characteristics and causes of microbial flora changes in various diseases are thus worth further exploration.
By pooling massive amounts of metagenomic samples and performing cross-disease meta-analysis and statistical analyses, we identified a set of gut microbial signatures that appear to be universal across diseases. These signatures were involved in the enrichment of some opportunistic pathogens that are commonly associated with different diseases (e.g., Erysipelatoclostridium ramosum, Ruminococcus gnavus, Peptostreptococcus stomatis, Streptococcus spp., Fusobacterium spp. and Enterocloster spp.) and depletion of beneficial SCFA producers (e.g., Faecalibacterium, Roseburia, and Blautia spp). This result is consistent with recent studies indicating that some shared features of the gut microbiota are associated with common health markers15,57. Furthermore, we developed a random forest classifier based on the microbial signatures and validated the reliability of the classifier in both the investigated datasets, external public datasets, and a newly recruited independent cohort of 3 autoimmune diseases. The overall performance of this classifier (optimal AUC, 0.776) was comparable with recent cross-population species-level microbiome studies and higher than the models trained by microbial functional signatures17. Therefore, although technical challenges currently limit the immediate clinical application of gut microbiota-based disease-health classifiers, we suggest that subsequent exploration can lead to their use as indicators of overall health evolution or facilitate targeting gut microbe-based intervention.
Notably, as the fecal metagenomes were collected from 36 different studies, we did not account for all interstudy heterogeneities (e.g., sample storage, DNA extraction and sequencing) among the investigated studies. Additionally, due to the lack of metadata, we were unable to adjust for the influences of confounding factors (e.g., age, BMI, and medication usage of patients) on the gut microbiome when performing within-study comparisons between cases and controls. To minimize these limitations, we (1) used strict criteria for the inclusion and exclusion of studies and samples, (2) reanalyzed all datasets based on a comprehensive database and used a unified pipeline, and (3) used meta-analysis to reduce the possible instability from a single study. These efforts partially accounted for the inconsistencies among different studies and samples, and they also contributed to the development of a standardized methodology for metagenome-based cross-disease analysis of the gut microbiome. Overall, our findings based on a large dataset highlighted that data-sharing efforts can provide broadly applicable findings for gut microbiome studies, demonstrated the strong associations between gut microbial diversity and structure and common human diseases, and provided new materials and prospects for future research.
Methods
Collection of public datasets and description of diseases
We searched for published gut microbiome studies in the PubMed and Google Scholar databases based on exhaustive keywords such as “gut metagenome”, “gut microbiota/microbiome”, and “shotgun sequencing” (as of January 2022). The materials of each study were manually reviewed, and 44 studies were included under the following criteria: (1) a case-control study of disease, (2) samples from Chinese individuals, and (3) availability of fecal metagenomic data. Eight of these studies were removed because they included <50 total samples or the case and control samples were not from the same batch. The raw whole-metagenome shotgun sequencing datasets of the remaining studies were downloaded from the National Center for Biotechnology Information—Sequence Read Archive (NCBI SRA), European Nucleotide Archive (ENA), and China National GeneBank (CNGB) databases. Metadata of the studies was obtained from the original articles and materials, from the NCBI/EBI/CNGB sample information, or by contacting the corresponding authors. Within each study with available metadata, samples were excluded under the following criteria: (1) nonstandard definitions of disease or unhealthy statuses (e.g., hypertension/diabetes were defined according to the latest standards), (2) no baseline samples for longitudinal sampling, and (3) BMI < 17 kg/m2 or >30 kg/m2 (for samples with available phenotypic data, except in the obesity and diabetes studies).
Thirty-six studies, covering 28 different diseases or unhealthy conditions, were included in this study. The most common disease subtype were autoimmune diseases, which encompassed 8 diseases from 9 studies (including AS, Behcet’s disease, GD, gout, pediatric myasthenia gravis, RA, SLE, and Vogt-Koyanagi-Harada disease), and cardiometabolic diseases, which comprised 7 diseases from 10 studies (including ACVD, AF, bone mass loss, carotid atherosclerosis, hypertension, obesity, and T2D). Additionally, our analyses included infectious diseases (COVID-19 infection, HIV infection, and PT), digestive diseases (CD, IBD, and IBS), psychiatric diseases (ASD and schizophrenia), and cancers (breast cancer and CRC). Other diseases belong to distinct subtypes, including polycystic ovarian syndrome (endocrine disease), ESRD (kidney disease), liver cirrhosis (liver disease), and PD (neurological disease).
Recruitment and metagenomic sequencing of the autoimmune cohort
This study was approved by the ethics committee of Dalian Medical University, and an informed consent agreement was obtained from each participant. Participants were recruited from the Second Affiliated Hospital of Dalian Medical University and the Second Affiliated Hospital of Guizhou University of Traditional Chinese Medicine. Patients with autoimmune diseases were included based on a confirmed diagnosis by a licensed physician, following the 2019 European League Against Rheumatism/American College of Rheumatology (EULAR/ACR) classification criteria58 for RA, SLE, and pSS. Exclusion criteria for participants included the presence of diabetes, severe hypertension, severe obesity or metabolic syndrome, IBD, cancers, and abnormal liver or kidney function. Individuals who had taken antibiotics or probiotic products within the past 4 weeks were also excluded. Based on these criteria, a total of 95 RA patients, 73 SLE patients, 66 pSS patients, and 118 healthy controls were included for further analysis.
Fecal specimens were collected from participants, temporarily stored on dry ice and transported to the laboratory within 24 h, and stored at -80°C for further analyses. DNA was extracted from fecal samples using the TIANamp Stool DNA Kit (TIANGEN, China) and its quality was assessed using the Qubit 2.0. The extracted DNA samples were stored at -80°C until use. The sequencing library was prepared using the NEB Next Ultra DNA Library Prep Kit (NEB, USA) according to the manufacturer’s instructions, with index codes were added to each sample. Library quality was confirmed using an Agilent 2100 Bioanalyzer. Clustering of the index-coded samples was performed on a cBot Cluster Generation System using an Illumina PE Cluster Kit (Illumina, USA) as per the manufacturer’s instructions. Following cluster generation, the DNA libraries were sequenced on the Illumina NovaSeq platform, generating 150 bp paired-end reads were generated. Quality control and removal of human contaminants were carried out using the same pipeline for publicly available samples.
Gut microbiome profiling of fecal metagenomes
For all samples, the raw metagenomic reads were processed for quality control using fastp59. Low-quality (>45 bases with a quality score < 20 or > 5 ‘N’ bases), low-complexity, and adapter-containing reads were removed. The remaining reads were trimmed at the tails for low quality (<Q20) or ‘N’ bases, and the trimmed reads with length <45 bases were also removed. Contaminating human sequences were eliminated by mapping against the reference human genome (GRCh38) using Bowtie260. Samples with <10 million high-quality nonhuman reads were removed. Finally, 36 studies spanning a total of 6314 fecal metagenomes were retained for follow-up analysis. Taxonomic profiling of the gut microbiome of all samples was analyzed using the MetaPhlAn 426 algorithm. The tool includes a large number of known and uncharacterized species in the human microbiome and clusters them into species-level genome bins (SGBs) for analysis. Its reference database contains ~22,000 known species and 5000 uncharacterized species.
Statistical analysis
Statistical analyses were implemented in the R 4.0.1 platform.
Comparison analyses
For case-control comparison analysis within each dataset, the P-values were calculated using the Wilcoxon rank-sum test. The q value was used to evaluate the false discovery rate (FDR) for the correction of multiple comparisons and was calculated based on the R p.adjust function based on the Benjamini-Hochberg algorithm61. Random effects meta-analysis was implemented in gut microbiome datasets based on the previously developed algorithm35. We employed the meta-analysis combined with the multivariate analysis by linear models (MaAsLin 2) algorithm62 in the revised manuscript to identify disease-associated differential signatures. For the meta-analysis, taxonomic relative abundances were converted to arcsine-square root-transformed proportions using the escalc function in the R metafor package that employs Hedges’ g standardized mean difference statistic to calculate the pooled effect size by a random effects model. Linear mixed effects model analysis was conducted using the lmer function of the LmerTest package63, with the study as a random effect, and the significance of models was calculated using the anova function. MaAsLin 2 allowed us to capture the correlations between diseases and microbial taxa by de-confounding the effects of other demographic parameters. A taxon was defined as differentially abundant if it met the following criteria: (1) random effects meta-analysis q < 0.05, and (2) MaAsLin 2 analysis q < 0.05 after accounting for age, sex, and BMI (considering only datasets with available metadata) as confounding factors.
Multivariate analyses
Clustering analysis of the gut microbiomes was performed based on the species-level relative abundance profiles using the Jensen-Shannon divergence and partition around medoids (PAM) clustering algorithm, following the tutorial of gut microbiome enterotype analysis30. The average silhouette width (ASW) was used to determine the optimal number of clusters, and an ASW of <0.3 suggested poor support for the existence of discrete clusters. PERMANOVA was performed with the R vegan package64, for which the effect size (R2) of disease statuses on microbiome variation was calculated using the adonis function, and the P-value was generated based on 1000 permutations. Principal component analysis (PCoA) and distance-based redundancy analysis (RDA) were performed with the R vegan package based on the Bray-Curtis dissimilarity. The Spearman correlation coefficient and its significance were assessed using the cor.test function.
Prediction analyses
Random forest analysis was realized with the R randomForest package (1000 trees), and the performance of the random forest models was assessed with tenfold cross-validation. LASSO analysis was performed based on the methodology developed by the previous study42. ROC analysis was implemented with the R pROC package, and the AUC was calculated accordingly. The random forest-based GMHI for each metagenomic sample was calculated based on the predictive values averaged from ten random forest classifiers.
Data availability
The metadata, gut microbial profiles of all analyzed samples, and statistical scripts are available on the GitHub website (https://github.com/yexianingyue/GM_common_diseases). The authors declare that all other data supporting the findings of the study are available in the paper and supplementary materials or from the corresponding authors upon request.
References
Ehrlich, S. D. The human gut microbiome impacts health and disease. C. R. Biol. 339, 319–323 (2016).
de Vos, W. M., Tilg, H., Van Hul, M. & Cani, P. D. Gut microbiome and health: mechanistic insights. Gut 71, 1020–1032 (2022).
Candela, M., Biagi, E., Maccaferri, S., Turroni, S. & Brigidi, P. Intestinal microbiota is a plastic factor responding to environmental changes. Trends Microbiol. 20, 385–391 (2012).
De Luca, F. & Shoenfeld, Y. The microbiome in autoimmune diseases. Clin. Exp. Immunol. 195, 74–85 (2019).
Chen, C. et al. Characterizations of the gut bacteriome, mycobiome, and virome in patients with osteoarthritis. Microbiol. Spectr. 11, e01711–e01722 (2023).
Chen, C. et al. Characterizations of the multi-kingdom gut microbiota in Chinese patients with gouty arthritis. BMC Microbiol. 23, 363 (2023).
Witkowski, M., Weeks, T. L. & Hazen, S. L. Gut microbiota and cardiovascular disease. Circ. Res. 127, 553–570 (2020).
Fan, Y. & Pedersen, O. Gut microbiota in human metabolic health and disease. Nat. Rev. Microbiol 19, 55–71 (2021).
Dhar, D. & Mohanty, A. Gut microbiota and Covid-19- possible link and implications. Virus Res. 285, 198018 (2020).
Jarbrink-Sehgal, E. & Andreasson, A. The gut microbiota and mental health in adults. Curr. Opin. Neurobiol. 62, 102–114 (2020).
Nikolova, V. L. et al. Perturbations in gut microbiota composition in psychiatric disorders: a review and meta-analysis. JAMA Psychiatry 78, 1343–1354 (2021).
Zhu, J. et al. Breast cancer in postmenopausal women is associated with an altered gut metagenome. Microbiome 6, 136 (2018).
Vivarelli, S. et al. Gut microbiota and cancer: from pathogenesis to therapy. Cancers 11, 38 (2019).
Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat. Med. 25, 968–976 (2019).
Duvallet, C., Gibbons, S. M., Gurry, T., Irizarry, R. A. & Alm, E. J. Meta-analysis of gut microbiome studies identifies disease-specific and shared responses. Nat. Commun. 8, 1784 (2017).
Mancabelli, L. et al. Identification of universal gut microbial biomarkers of common human intestinal diseases by meta-analysis. FEMS Microbiol. Ecol. https://doi.org/10.1093/femsec/fix153 (2017).
Armour, C. R., Nayfach, S., Pollard, K. S. & Sharpton, T. J. A metagenomic meta-analysis reveals functional signatures of health and disease in the human gut microbiome. MSystems 4, e00332–00318 (2019).
Yan, Q. et al. A genomic compendium of cultivated human gut fungi characterizes the gut mycobiome and its relevance to common diseases. Cell 187, 2969–2989.e2924 (2024).
Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).
Gurevitch, J., Koricheva, J., Nakagawa, S. & Stewart, G. Meta-analysis and the science of research synthesis. Nature 555, 175–182 (2018).
Ho, N. T. et al. Meta-analysis of effects of exclusive breastfeeding on infant gut microbiota across populations. Nat. Commun. 9, 4169 (2018).
Jiang, P., Wu, S., Luo, Q., Zhao, X. M. & Chen, W. H. Metagenomic analysis of common intestinal diseases reveals relationships among microbial signatures and powers multidisease diagnostic models. mSystems 6, e00112–21 (2021).
Wu, Q., Badu, S., So, S. Y., Treangen, T. J. & Savidge, T. C. The pan-microbiome profiling system Taxa4Meta identifies clinical dysbiotic features and classifies diarrheal disease. J. Clin. Invest. 134, e170859 (2024).
Su, Q. et al. Faecal microbiome-based machine learning for multi-class disease diagnosis. Nat. Commun. 13, 6818 (2022).
Tierney, B. T. et al. Systematically assessing microbiome-disease associations identifies drivers of inconsistency in metagenomic research. PLoS Biol. 20, e3001556 (2022).
Blanco-Míguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 41, 1633–1644 (2023).
Suzuki, T. A. & Worobey, M. Geographical variation of human gut microbial composition. Biol. Lett. 10, 20131037 (2014).
Wagner Mackenzie, B., Waite, D. W. & Taylor, M. W. Evaluating variation in human gut microbiota profiles due to DNA extraction method and inter-subject differences. Front Microbiol. 6, 130 (2015).
Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011).
Costea, P. I. et al. Enterotypes in the landscape of gut microbial community composition. Nat. Microbiol. 3, 8–16 (2018).
Machiels, K. et al. A decrease of the butyrate-producing species roseburia hominis and faecalibacterium prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut 63, 1275–1283 (2014).
Morrison, D. J. & Preston, T. Formation of short chain fatty acids by the gut microbiota and their impact on human metabolism. Gut Microbes 7, 189–200 (2016).
Koh, A., De Vadder, F., Kovatcheva-Datchary, P. & Backhed, F. From dietary fiber to host physiology: short-chain fatty acids as key bacterial metabolites. Cell 165, 1332–1345 (2016).
Martin-Gallausiaux, C., Marinelli, L., Blottiere, H. M., Larraufie, P. & Lapaque, N. SCFA: mechanisms and functional importance in the gut. Proc. Nutr. Soc. 80, 37–49 (2021).
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).
Xia, X. et al. Bacteria pathogens drive host colonic epithelial cell promoter hypermethylation of tumor suppressor genes in colorectal cancer. Microbiome 8, 1–13 (2020).
Wang, X. et al. Aberrant gut microbiota alters host metabolome and impacts renal failure in humans and rodents. Gut 69, 2131–2142 (2020).
Wang, Q. et al. A metagenome-wide association study of gut microbiota in asthma in UK adults. BMC Microbiol. 18, 114 (2018).
Cekanaviciute, E. et al. Gut bacteria from multiple sclerosis patients modulate human T cells and exacerbate symptoms in mouse models. Proc. Natl Acad. Sci. USA 114, 10713–10718 (2017).
Gupta, A. et al. Association of flavonifractor plautii, a flavonoid-degrading bacterium, with the gut microbiome of colorectal cancer patients in India. mSystems 4, e00438-19 (2019).
Li, Z. et al. Multi-omics analyses of serum metabolome, gut microbiome and brain function reveal dysregulated microbiota-gut-brain axis in bipolar depression. Mol. Psychiatry 27, 4123–4135 (2022).
Chen, F. et al. Meta-analysis of fecal viromes demonstrates high diagnostic potential of the gut viral signatures for colorectal cancer and adenoma risk assessment. J. Adv. Res. 49, 103–114 (2022).
Zhang, P. et al. Metagenome-wide analysis uncovers gut microbial signatures and implicates taxon-specific functions in end-stage renal disease. Genome Biol. 24, 226 (2023).
Carding, S., Verbeke, K., Vipond, D. T., Corfe, B. M. & Owen, L. J. Dysbiosis of the gut microbiota in disease. Micro. Ecol. Health Dis. 26, 26191 (2015).
Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. 14, 508–522 (2016).
Young, V. B. The role of the microbiome in human health and disease: an introduction for clinicians. BMJ 356, j831 (2017).
Hills, R. D., Jr. et al. Gut microbiome: profound implications for diet and disease. Nutrients 11, 1613 (2019).
Magne, F. et al. The firmicutes/bacteroidetes ratio: a relevant marker of gut dysbiosis in obese patients? Nutrients 12, 1474 (2020).
Mosca, A., Leclerc, M. & Hugot, J. P. Gut microbiota diversity and human diseases: should we reintroduce key predators in our ecosystem? Front. Microbiol. 7, 455 (2016).
Kriss, M., Hazleton, K. Z., Nusbacher, N. M., Martin, C. G. & Lozupone, C. A. Low diversity gut microbiota dysbiosis: drivers, functional implications and recovery. Curr. Opin. Microbiol 44, 34–40 (2018).
Manor, O. et al. Health and disease markers correlate with gut microbiome composition across thousands of people. Nat. Commun. 11, 1–12 (2020).
Qiu, P. et al. The gut microbiota in inflammatory bowel disease. Front Cell Infect. Microbiol. 12, 733992 (2022).
Chu, W. et al. Metagenomic analysis identified microbiome alterations and pathological association between intestinal microbiota and polycystic ovary syndrome. Fertil. Steril. 113, 1286–1298.e1284 (2020).
Zhou, L. et al. Characteristic gut microbiota and predicted metabolic functions in women with PCOS. Endocr. Connect 9, 63–73 (2020).
Yeoh, Y. K. et al. Gut microbiota composition reflects disease severity and dysfunctional immune responses in patients with COVID-19. Gut 70, 698–706 (2021).
Hu, Y. et al. Gut microbiota associated with pulmonary tuberculosis and dysbiosis caused by anti-tuberculosis drugs. J. Infect. 78, 317–322 (2019).
Jackson, M. A. et al. Gut microbiota associations with common diseases and prescription medications in a population-based cohort. Nat. Commun. 9, 2655 (2018).
Aringer, M. et al. 2019 European league against rheumatism/American college of rheumatology classification criteria for systemic lupus erythematosus. Arthritis Rheumatol. 71, 1400–1412 (2019).
Chen, S., Zhou, Y., Chen, Y. & Gu, J. fastp: an ultra-fast all-in-one FASTQ preprocessor. Bioinformatics 34, i884–i890 (2018).
Langmead, B. & Salzberg, S. L. Fast gapped-read alignment with Bowtie 2. Nat. Methods 9, 357–359 (2012).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (Methodol.) 57, 289–300 (1995).
Mallick, H. et al. Multivariable association discovery in population-scale meta-omics studies. PLoS Comput. Biol. 17, e1009442 (2021).
Kuznetsova, A., Brockhoff, P. B. & Christensen, R. H. lmerTest package: tests in linear mixed effects models. J. Stat. Softw. 82, 1–26 (2017).
Dixon, P. VEGAN, a package of R functions for community ecology. J. Veg. Sci. 14, 927–930 (2003).
Acknowledgements
This work was supported by grants from the National Natural Science Foundation of China (81902037, 81503455, and 82370563), and Beijing University of Chinese Medicine (NO.5050071720001 and NO.2180072120049).
Author information
Authors and Affiliations
Contributions
S.L., Q.Y., Yue Z., R.G., X.M., and W.S. contributed to the conception and design of the study. Yue Z., L.C., S.F., and R.L. downloaded and processed the publicly available datasets. Yue Z., Yan Z., S.L., R.G., S.S., J.M., H.U. and Q.L. performed the bioinformatics analyses. S.L., S.S., and C.C. wrote the manuscript. S.L. and W.S. helped to draft the manuscript. J.M. and W.Y. contributed meaningful discussions. All authors were involved in preparing the manuscript and contributed to manuscript revision, reading, and approving the submitted version.
Corresponding authors
Ethics declarations
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Sun, W., Zhang, Y., Guo, R. et al. A population-scale analysis of 36 gut microbiome studies reveals universal species signatures for common diseases. npj Biofilms Microbiomes 10, 96 (2024). https://doi.org/10.1038/s41522-024-00567-9
Received:
Accepted:
Published:
DOI: https://doi.org/10.1038/s41522-024-00567-9