Introduction

The gut microbiota is currently considered a key factor contributing to the regulation of host health1,2. Generally, the overall structure of the gut microbiota is relatively stable despite acute perturbations because of its plasticity, which allows it to quickly return to its initial composition3. However, when hosts are continuously exposed to various pollutants, stresses and diseases, the composition of the gut microbiota might change (dysbiosis), promoting the selection of more virulent microorganisms and potentially harming host health3. With the advancement of amplification-based and whole-metagenomic sequencing technologies, gut microbiota dysbiosis has been widely reported in many common diseases, including autoimmune disorders4,5,6, cardiometabolic conditions7,8, infectious diseases9, psychiatric disorders10,11, and cancers12,13,14. The altered microbiome likely plays a crucial role in these diseases. The varying microbial changes across different diseases emphasize the diverse roles of the microbiota in health and disease states15,16,17,18. However, due to the lack of unified reference databases, the low accuracy of bacterial species annotation and quantification in high-throughput sequencing datasets, and the highly variable experimental and analytical methods used, the signatures of the gut microbiota in different disease states are often incomparable. Additionally, the precise cause of microbial dysfunction in these diseases is not completely understood. Therefore, uniform methods to characterize the gut microbiota in multiple diseases, especially using publicly available datasets, are necessary to identify the overall pattern of disease-associated microbiota shifts.

Meta-analysis, which combines data from multiple studies, can help avoid biases inherent in individual studies19,20,21,22,23 and has proven effective in signature identification and diagnosis across diseases under large-scale microbiome datasets15,24,25. In this study, we reanalyzed raw sequencing data from publicly available metagenomic datasets, comprising 6314 human fecal samples (3728 patients with 28 disease or unhealthy statuses and 2586 healthy controls) from 36 studies of the Chinese population (Supplementary Table 1), using a unified pipeline. The differences in microbial diversity and compositional structure between cases and controls in each study were illustrated by conducting a comparative analysis. Next, integrated meta-analysis was applied to identify the microbial signatures that are universal to diseases across all studies, and machine learning classifiers were established based on the abundance of microbial signatures to investigate the potential of generic gut microbial features in predicting disease states. Our results can be applied to gut microbiological prediction for disease status and guide future specific interventions in different diseases.

Results

Datasets and overview of the gut microbiome

To investigate the gut microbial signatures across various diseases, we collected publicly available metagenomic data of human fecal samples from 36 case-control studies spanning 13 provinces of China (Supplementary Fig. 1a, b). For each study, samples with abnormal body mass index (BMI) (if metadata was available) or a low amount of data (<10 million reads) were removed. These datasets contained samples of 3728 patients with 28 different disease or unhealthy statuses, including immune (8 diseases from 9 studies), cardiometabolic (7 diseases from 10 studies), infectious (3 diseases from 3 studies), digestive (3 diseases from 3 studies), and psychiatric (2 diseases from 3 studies) diseases, cancers (2 diseases from 3 studies), and 4 other diseases, and their accompanying 2586 healthy controls (Fig. 1a; Supplementary Table 1). Notably, both patients and controls exhibited significant diversity in gender, age, and BMI across these datasets (Supplementary Fig. 1c), providing a comprehensive perspective on the characteristics of the Chinese population. A total of 6314 fecal metagenomes were processed with a unified pipeline, resulting in 50.3 Tbp of high-quality nonhuman metagenomic data for further analysis.

Fig. 1: Taxonomic profiles of the human gut microbiome.
figure 1

a Bar plot showing the number of diseased and control individuals for each study. b–d Taxonomic compositions of the gut microbiota at the phylum (b), genus (c), and species (d) levels. The bars indicate the average relative abundances of each taxonomy and are colored by their taxonomic assignments. BC breast cancer, CRC colorectal cancer, yCRC young-onset colorectal cancer, oCRC old-onset colorectal cancer, ACVD atherosclerotic cardiovascular disease, AF atrial fibrillation, BL bone mass loss, CA carotid atherosclerosis, HTN hypertension, pHTN prehypertension, MUO metabolically unhealthy obese, OB obesity, T2D type 2 diabetes, pT2D prediabetes, tnT2D treatment-naive type 2 diabetes, CD Crohn’s disease, UC ulcerative colitis, IBS irritable bowel syndrome, PCOS polycystic ovarian syndrome, AS ankylosing spondylitis, BD Behcet’s disease, GD Graves’ disease, MG myasthenia gravis, RA rheumatoid arthritis, SLE systemic lupus erythematosus, VKH Vogt-Koyanagi-Harada disease, PT pulmonary tuberculosis, LC liver cirrhosis, ASD autism spectrum disorder, SCZ schizophrenia, PD Parkinson’s disease, ESRD end-stage renal disease. Disease groups of four studies were subdivided by subtypes (i.e., yCRC and oCRC for study YangY_2021, CD and UC for study WengY_2019) or severity (i.e., pHTN and HTN for LiJ_2017, pT2D and tnT2D for ZhongH_2019).

The gut microbial compositions of all samples were profiled based on the MetaPhlan426. At the phylum level, Bacteroidetes and Firmicutes were the most abundant phyla, with an average relative abundance of 48.4% ± 15.7% (ranging from 6.3% to 71.4%) and 42.7% ± 14.1% (ranging from 21.6% – 73.5%), respectively, across the 36 studies (Fig. 1b; Supplementary Fig. 2a). These were followed by Proteobacteria (average relative abundance 4.5% ± 2.4%, ranging from 0.7% – 14.2%) and Actinobacteria (average relative abundance 2.8% ± 3.8%, ranging from 0.2% – 19.0%). Other phyla combined account for only an average of 1.2% ± 0.7% of the abundance in the gut microbiome. The proportions of these phyla were highly diverse across studies, likely reflecting that the differences in geography or experimental methods (e.g., sample preparation and DNA extraction) of different studies may lead to tremendous heterogeneity in the gut microbial compositions (Supplementary Fig. 2b)27,28. At the genus level, Phocaeicola (average relative abundance 15.6% ± 5.4%), Bacteroides (14.2 ± 5.6%), Prevotella (11.6% ± 7.5%), Faecalibacterium (5.8% ± 3.3%), Alistipes (3.6% ± 2.3%), and Roseburia (2.7% ± 1.1%) were the dominant genera (Fig. 1c); these genera were distributed in a similar fashion across all studies except WanY_2021. Enterotype analysis based on the genus profiles of all samples generated two enterotypes characterized by Bacteroides (75.7% of samples) and Prevotella (24.3% of samples) (Supplementary Fig. 3); this result differed from that in previous studies classifying the human gut microbiome into three enterotypes29,30, probably because their boundaries are indistinguishable in our large-scale datasets. At the species level, several members of Phocaeicola (including P. vulgatus, P. plebeius, and P. dorei), Bacteroides (including B. uniformis and B. stercoris), Prevotella copri, and Faecalibacterium prausnitzii were dominant across all analyzed samples (Fig. 1d).

The gut microbiome is associated with multiple diseases

To illustrate the alteration of the gut microbiome in disease, we conducted a comparative analysis of microbial diversity and compositional structure between cases and controls within each study. Analyses were performed for 40 case-control comparisons, as the disease groups of several studies were subdivided by subtypes (e.g., Crohn’s disease [CD] and ulcerative colitis [UC] for inflammatory bowel disease [IBD]) or severity (e.g., prehypertension and hypertension) (Fig. 1a). First, we found that 11 case-control comparisons showed a significant decrease in species richness (estimated by the number of observed species) in disease groups compared with their corresponding control groups, whereas only 1 case-control comparison showed a significant increase in richness and diversity (Wilcoxon rank-sum test, q < 0.05 after adjusted for sex, age, and BMI; Fig. 2a, b; Supplementary Table 2). Similarly, 12 case-control comparisons exhibited lower species diversity (estimated by the Shannon diversity index) in disease subjects compared to controls, whereas only 2 comparisons showed a higher level. The most prominent disease was Crohn’s disease, which was associated with over 10% decreases in both species’ richness and diversity indexes across two comparisons (HeQ_2017 and WengY_2019.CD). Subsequently, patients with COVID-19 infection (YeohYK_2021), pulmonary tuberculosis (PT) (HuY_2019), hypertension (LiJ_2017.HTN and YanQ_2017), systemic lupus erythematosus (SLE) (ChenB_2020), liver cirrhosis (LC) (QinN_2014), gout (ChuY_2021), Graves’ disease (GD) (ZhuQ_2021) also had an over 10% decrease in species richness and diversity, and one study of ankylosing spondylitis (AS) (ZhouC_2020) also showed over 20% decrease in species richness and diversity. Conversely, increases in species richness and diversity were found in patients with Parkinson’s disease (PD) (QianY_2020 and MaoL_2021) and atrial fibrillation (AF) (ZuoK_2019).

Fig. 2: Alterations of the gut microbiome across common diseases.
figure 2

Bar plot showing the fold changes in gut microbial richness (a) and diversity (b), the disease-related effect size (c), and the within-study AUCs (d) of 40 case-control comparisons. Diseases are colored by disease types. For a and b, Wilcoxon rank-sum test: *p < 0.05; **p < 0.01; ***p < 0.001. For (c) adonis analysis with 1000 permutations: *p < 0.05; **p < 0.01; ***p < 0.001. For (d) the dashed line shows an AUC of 0.50, and the error bars show the 95% confidence intervals of the AUC values.

Permutational multivariate analysis of variance (PERMANOVA) of the gut microbial composition within each study showed that in 27 of 40 case-control comparisons, the disease state significantly impacted the overall compositional structure of the gut microbiome (adonis p < 0.05; Fig. 2c). Among these subjects, patients with CD (HeQ_2017 and WengY_2019.CD), polycystic ovary syndrome (QiX_2019), AF (ZuoK_2019), GD (ZhuQ_2021), SLE (ChenB_2020), LC (QinN_2014), PT (HuY_2019), and COVID-19 (YeohYK_2021) had the greatest change in their gut microbiomes. We trained random forest models to classify cases and controls within each disease. The models achieved a considerably high classifiability based on the area under the receiver operating characteristic curve (AUC) > 0.7 in 28 of 40 case-control comparisons and an average AUC of 0.759 across all comparisons (Fig. 2d). Taken together, our results indicated profound changes in the gut microbiome in many different diseases.

Shared gut microbial signatures across diseases

Previous studies have demonstrated that many diseases have some common gut microbiome signatures15. In our dataset, PERMANOVA across all samples revealed that disease status can significantly impact the overall gut microbiome, with a modest effect size (R2 = 0.43%, adonis p < 0.001). As a comparison, the individuals’ sex, age, and BMI together explained only 0.26% (adonis p < 0.001) of the gut microbiome variation. These findings suggested the existence of shared gut species signatures for healthy individuals. Considering these findings, we next sought to identify the microbial signatures that are universal to diseases, using an algorithm that combining the random effects meta-analysis and phenotype-adjusted Masslin2 across all studies (see Methods). We identified 277 species that differed in relative abundance between diseased individuals and healthy controls (Supplementary Table 3). A total of 194 differentially abundant species were more abundant in the healthy controls than in the diseased subjects, while 83 species were enriched in the disease group.

Control-enriched and disease-enriched species were markedly separated in their taxonomic distribution at the phylum and genus levels (Fig. 3a). Most of the control-enriched species belonged to the phylum Firmicutes, including members of Clostridium (containing C. fessum, C. leptum, and a variety of unclassified species), Blautia (e.g., B. faecicola, B. glucerasea, B. massiliensis, and B. stercoris), Roseburia (e.g., R. intestinalis, R. faecis, R. hominis, and R. inulinivorans), Faecalibacterium (e.g., F. prausnitzii), and Ruminococcus (e.g., R. bicirculans, R. bromii, R. callidus, and R. lactaris). The remaining control-enriched species included several Bacteroidetes members, such as species of Bacteroides (e.g., B. cellulosilyticus, B. eggerthii, B. faecis, B. finegoldii, B. intestinalis, and B. uniformis) and Alistipes (e.g., A. inops, A. senegalensis, and A. shahii). The reductions in the abundances of many of these species were previously identified by the corresponding studies; these results were also reproduced in our analyses (Fig. 3b). Notably, members of Faecalibacterium, Roseburia, and Blautia are the most important producers of short-chain fatty acids (SCFAs) in the human gut31,32,33; this finding was in agreement with previous speculations, which suggested that reduced SCFA biosynthesis ability is a common characteristic for human diseases34.

Fig. 3: Gut microbial signatures across common diseases.
figure 3

a Overall representation of the 277 universal gut microbial signatures. Colored stars represent the species that enriched in cases (red) and controls (green). Only species with known genus-level taxa are shown. b Heatmap showing the fold changes in the abundances of some representative gut species within 40 case-control comparisons. Fold change < 0, enriched in cases; fold change > 0, enriched in controls. Wilcoxon rank-sum test: *q < 0.05; **q < 0.01; ***q < 0.001.

Conversely, the disease-enriched species were mainly opportunistic pathogens (Fig. 3a, b). For example, the overgrowth of Streptococcus (containing S. anginosus, S. constellatus, S. gallolyticus, S. gordonii, S. infantis, S. mutans, S. oralis, and S. parasanguinis), Enterocloster (e.g., E. aldensis, E. bolteae, E. citroniae, and E. clostridioformis), Escherichia coli, and Erysipelatoclostridium ramosum have previously been found in many diseases. Members of Fusobacterium (including F. mortiferum, F. nucleatum, F. pseudoperiodonticum, F. ulcerans, and F. varium) and Hungatella hathewayi are typical bacteria associated with colorectal cancer (CRC)35,36. Eggerthella lenta and Flavonifractor plautii were reported to be enriched in end-stage renal disease (ESRD) patients with uremic toxin-producing effects37, and have also been associated with several chronic diseases such as asthma38, multiple sclerosis39, and CRC40. Additionally, several members of Lactobacillus, including L. amylovorus, L. crispatus, L. gasseri, and L. mucosae, were enriched in disease subjects.

At the genus level, we identified 107 genera with significant differences in relative abundance between diseased and healthy subjects, with 73 being more abundant in controls and 34 in patients (Supplementary Fig. 4; Supplementary Table 4). Consistent with the species-level results, several genera involved in SCFA production such as Faecalibacterium, Roseburia, and Butyricicoccus were reduced in diseased subjected compared to healthy controls, whereas some harmful bacteria such as Fusobacterium, Escherichia, and Enterocloster were enriched in patients.

Disease characteristics with respect to universal gut species signatures

Considering the universal gut species signatures, we categorized the diseases into several groups based on the gross relative abundances of control-enriched (proposed as “healthy bacteria”) and disease-enriched species (proposed as “unhealthy bacteria”). First, among 13 of 40 case-control comparisons, the microbiomes of patients were characterized by a significant depletion of healthy bacteria accompanied by an expansion of unhealthy bacteria (Fig. 4). These comparisons involved 10 diseases, including breast cancer, CRC (YuJ_2017), hypertension (YanQ_2017), obesity (ZengQ_2021 and LiuR_2017), type 2 diabetes (T2D) (QinJ_2012), IBD (both CD and UC), GD, SLE, COVID-19, and LC. Second, 7 comparisons involving gout, hypertension (LiJ_2017), prehypertension, AS (ZhouC_2020), PT, autism spectrum disorder (ASD) (WanY_2021), and PD (QianY_2021) were characterized by an isolated depletion of healthy bacteria. For example, the gut microbiota of gout patients showed a marked decrease in the abundance of SCFA producers, but only a few harmful bacteria (e.g., Parabacteroides distasonis and Hungatella effluvii) were enriched. Third, 8 comparisons involving CRC (Yang_2021), AF, atherosclerotic cardiovascular disease (ACVD), irritable bowel syndrome (IBS), PCOS, rheumatoid arthritis (RA), and ESRD featured an isolated enrichment of unhealthy bacteria. An example of these diseases is AF, which caused serious overgrowth of Streptococcus and Enterococcus spp., although the abundance of beneficial bacteria did not decrease. Finally, for the remaining 12 case-control comparisons, no significant changes were observed in the abundances of either healthy or unhealthy bacteria. These comparisons considered almost all psychiatric and neurological diseases (i.e., ASD, schizophrenia, and PD), HIV infection, 3 immune diseases (Behcet’s disease, myasthenia gravis, and Vogt-Koyanagi-Harada disease), and 3 cardiometabolic diseases (carotid atherosclerosis, bone mass loss, and prediabetes/treatment-naive T2D).

Fig. 4: Comparison of control-enriched and disease-enriched species between diseased and control individuals.
figure 4

Case-control comparisons with a significance different of gross relative abundance between patients and controls are highlighted. Boxes represent the interquartile range between the first and third quartiles and the median (internal line). Whiskers denote the lowest and highest values within 1.5 times the range of the first and third quartiles, respectively; dots represent outlier samples beyond the whiskers. Wilcoxon rank-sum test: *p < 0.05; **p < 0.01; ***p < 0.001.

Prediction of disease status using gut microbial signatures

To investigate the potential of the universal gut microbial signatures in predicting disease status, we trained a random forest classifier based on the relative abundances of 277 differentially abundant species and tested its performance using a tenfold cross-validation approach. This classifier achieved an AUC of 0.776 (95% confidence interval [CI], 0.764–0.787) in classifying cases and controls across all investigated samples (Fig. 5a). In addition, it achieved an AUC of 0.825 (95% CI, 0.806-0.845) in distinguishing the patients with high-risk diseases from their corresponding control subjects, highlighting the high predictability of the statuses of these patients based on the gut microbial signatures. In terms of importance, some disease-enriched species, including Ligilactobacillus salivarius, Ruminococcus gnavus, Clostridium symbiosum, Streptococcus parasanguinis, Mediterraneibacter glycyrrhizinilyticus, Eggerthella lenta, Fusobacterium mortiferum, Blautia spp. (B. producta and B. hansenii), and Peptostreptococcus stomatis featured the highest scores for the discrimination of patients and healthy controls (Fig. 5b). Random forest analysis at the genus level also generated an AUC of 0.787 (95% CI, 0.775–0.798) for classifying all patients from controls and 0.823 (95% CI, 0.803–0.843) for classifying high-risk patients from controls (Supplementary Fig. 5a). Similar classification results were also observed within different disease subtypes as well (Supplementary Fig. 5b). Interestingly, reducing the set of signatures to the 80 most important species generated AUCs close to those obtained using all signatures for classifying cases versus controls (Supplementary Fig. 5c), suggesting that a minimal set of gut microbial signatures might be explored in the future. Additionally, analyses using other classifiers based on the least absolute shrinkage and selection operator (LASSO) algorithm did not achieve satisfactory results, with AUCs of 0.735 (95% CI, 0.723–0.747) and 0.705 (95% CI, 0.692–0.717) at the species and genus levels, respectively (Supplementary Fig. 5d).

Fig. 5: Prediction of health status using gut species signatures.
figure 5

a Receiver operating characteristic (ROC) analysis of the classification of case/control status using the random forest model trained by 277 universal gut species signatures. The classification performance of the model was assessed by the area under the ROC curve (AUC). The AUC values and 95% confidence intervals (CIs) are shown. b The 20 most discriminant species-level signatures in the model classifying diseased and control individuals. The bar lengths indicate the importance of the variables. c Histogram showing the distribution of the GMHI in diseased and control individuals. d ROC analysis of the classification of case/control status in independent public cohorts using the random forest trained by the original cohorts. e ROC analysis of the classification of patients with autoimmune diseases and healthy controls. BD bipolar depression, CRC colorectal cancer, ESRD end-stage renal disease, RA rheumatoid arthritis, SLE systemic lupus erythematosus, SS primary Sjögren’s syndrome.

We generated a random forest-based gut microbiome health index (GMHI) for each metagenomic sample based on the predictive values averaged from ten random forest classifiers (see Methods). The GMHI theoretically ranged from 0 (extreme healthy state) to 1 (extreme disease state), with actual values of 0.655 ± 0.151 (mean±S.D.) and 0.491 ± 0.141 for all disease and healthy individuals, respectively (Mann–Whitney U test, p 2.2 × 10−16; Cliff’s delta = 0.571; Fig. 5c). A total of 81.0% (727/898) of metagenome samples with GMHIs < 0.4 were from the healthy group, and 80.6% (2,346/2,912) of metagenome samples with GMHIs >0.6 were of nonhealthy origin. Moreover, we found strong correlations between the GMHI and gut microbial diversity indexes (Spearman’s ρ = -0.37 for richness and ρ = -0.36 for diversity; Supplementary Fig. 6a). However, considering that the predictive power of microbial diversity was very poor (Supplementary Fig. 6b), we concluded that the GMHI is a more effective indicator of health.

Finally, to validate the reliability of our classifiers, we analyzed the fecal metagenomes from three independent public cohorts: (1) bipolar depression41, (2) CRC42, and (3) ESRD43. Another newly recruited cohort comprising 234 patients with autoimmune diseases (including 95 RA patients, 73 SLE patients, and 66 SS patients) and 118 healthy controls were also included for validation. Using these cohorts, we quantified the relative abundances of the 277 disease-associated species and compared them between cases and controls. The AUCs of the original random forest classifier on these new cohorts were 0.637 (95% CI, 0.533–0.741), 0.838 (95% CI, 0.736–0.940), and 0.836 (95% CI, 0.786–0.887) for BD, CRC, and ESRD versus controls, respectively (Fig. 5d), suggesting that the classifier may perform better for high-risk diseases like CRC and ESRD, whereas the discrimination for psychiatric diseases was less evident. In autoimmune diseases, the original random forest classifier achieved an AUC of 0.555 (95% CI, 0.459–0.651), 0.638 (95% CI, 0.545–0.732), and 0.717 (95% CI, 0.611–0.823) for RA, SLE, and SS, respectively (Fig. 5e). Consistent with previous results, the low discrimination performance for RA may be related to the relatively smaller changes in the gut microbiome. Overall, these findings suggested that the generalized disease-associated gut microbial signatures identified in this study can accurately classify multiple disease states from a healthy state.

Discussion

The human gut harbors a vast array of microbes that significantly influence host health and disease status. Gut dysbiosis not only affects gastrointestinal diseases (e.g., diarrhea and IBS) but also plays a role in immune-related diseases (e.g., IBD, RA, multiple sclerosis, and allergies), central nervous system conditions (Alzheimer’s disease, autism, and PD), and disorders in host energy metabolism (e.g., obesity, atherosclerosis, and T2D)44,45,46,47,48. In this study, we employed a unified meta-analysis pipeline, starting from raw metagenomic reads of 6,314 human fecal samples from the Chinese population, to conduct a comprehensive analysis of the gut microbiome characteristics of these samples. We found that gut species richness and diversity were significantly decreased in patients with various diseases, corroborating previous studies that suggest decreased microbial diversity may be linked to poor health status in humans49,50,51. Our analysis revealed that the overall microbiome structures are markedly altered in most of the investigated diseases; in particular, the patients with several disease statuses, including IBD, PCOS, COVID-19 infection, SLE, and LC, experienced the most prominent alteration in their gut microbiota. Profound variation of the gut microbiota was also found in previous disease-specific studies52,53,54,55,56. Typically, it is not clear whether the changes in the microbiota are a cause or a consequence of these diseases; the characteristics and causes of microbial flora changes in various diseases are thus worth further exploration.

By pooling massive amounts of metagenomic samples and performing cross-disease meta-analysis and statistical analyses, we identified a set of gut microbial signatures that appear to be universal across diseases. These signatures were involved in the enrichment of some opportunistic pathogens that are commonly associated with different diseases (e.g., Erysipelatoclostridium ramosum, Ruminococcus gnavus, Peptostreptococcus stomatis, Streptococcus spp., Fusobacterium spp. and Enterocloster spp.) and depletion of beneficial SCFA producers (e.g., Faecalibacterium, Roseburia, and Blautia spp). This result is consistent with recent studies indicating that some shared features of the gut microbiota are associated with common health markers15,57. Furthermore, we developed a random forest classifier based on the microbial signatures and validated the reliability of the classifier in both the investigated datasets, external public datasets, and a newly recruited independent cohort of 3 autoimmune diseases. The overall performance of this classifier (optimal AUC, 0.776) was comparable with recent cross-population species-level microbiome studies and higher than the models trained by microbial functional signatures17. Therefore, although technical challenges currently limit the immediate clinical application of gut microbiota-based disease-health classifiers, we suggest that subsequent exploration can lead to their use as indicators of overall health evolution or facilitate targeting gut microbe-based intervention.

Notably, as the fecal metagenomes were collected from 36 different studies, we did not account for all interstudy heterogeneities (e.g., sample storage, DNA extraction and sequencing) among the investigated studies. Additionally, due to the lack of metadata, we were unable to adjust for the influences of confounding factors (e.g., age, BMI, and medication usage of patients) on the gut microbiome when performing within-study comparisons between cases and controls. To minimize these limitations, we (1) used strict criteria for the inclusion and exclusion of studies and samples, (2) reanalyzed all datasets based on a comprehensive database and used a unified pipeline, and (3) used meta-analysis to reduce the possible instability from a single study. These efforts partially accounted for the inconsistencies among different studies and samples, and they also contributed to the development of a standardized methodology for metagenome-based cross-disease analysis of the gut microbiome. Overall, our findings based on a large dataset highlighted that data-sharing efforts can provide broadly applicable findings for gut microbiome studies, demonstrated the strong associations between gut microbial diversity and structure and common human diseases, and provided new materials and prospects for future research.

Methods

Collection of public datasets and description of diseases

We searched for published gut microbiome studies in the PubMed and Google Scholar databases based on exhaustive keywords such as “gut metagenome”, “gut microbiota/microbiome”, and “shotgun sequencing” (as of January 2022). The materials of each study were manually reviewed, and 44 studies were included under the following criteria: (1) a case-control study of disease, (2) samples from Chinese individuals, and (3) availability of fecal metagenomic data. Eight of these studies were removed because they included <50 total samples or the case and control samples were not from the same batch. The raw whole-metagenome shotgun sequencing datasets of the remaining studies were downloaded from the National Center for Biotechnology Information—Sequence Read Archive (NCBI SRA), European Nucleotide Archive (ENA), and China National GeneBank (CNGB) databases. Metadata of the studies was obtained from the original articles and materials, from the NCBI/EBI/CNGB sample information, or by contacting the corresponding authors. Within each study with available metadata, samples were excluded under the following criteria: (1) nonstandard definitions of disease or unhealthy statuses (e.g., hypertension/diabetes were defined according to the latest standards), (2) no baseline samples for longitudinal sampling, and (3) BMI < 17 kg/m2 or >30 kg/m2 (for samples with available phenotypic data, except in the obesity and diabetes studies).

Thirty-six studies, covering 28 different diseases or unhealthy conditions, were included in this study. The most common disease subtype were autoimmune diseases, which encompassed 8 diseases from 9 studies (including AS, Behcet’s disease, GD, gout, pediatric myasthenia gravis, RA, SLE, and Vogt-Koyanagi-Harada disease), and cardiometabolic diseases, which comprised 7 diseases from 10 studies (including ACVD, AF, bone mass loss, carotid atherosclerosis, hypertension, obesity, and T2D). Additionally, our analyses included infectious diseases (COVID-19 infection, HIV infection, and PT), digestive diseases (CD, IBD, and IBS), psychiatric diseases (ASD and schizophrenia), and cancers (breast cancer and CRC). Other diseases belong to distinct subtypes, including polycystic ovarian syndrome (endocrine disease), ESRD (kidney disease), liver cirrhosis (liver disease), and PD (neurological disease).

Recruitment and metagenomic sequencing of the autoimmune cohort

This study was approved by the ethics committee of Dalian Medical University, and an informed consent agreement was obtained from each participant. Participants were recruited from the Second Affiliated Hospital of Dalian Medical University and the Second Affiliated Hospital of Guizhou University of Traditional Chinese Medicine. Patients with autoimmune diseases were included based on a confirmed diagnosis by a licensed physician, following the 2019 European League Against Rheumatism/American College of Rheumatology (EULAR/ACR) classification criteria58 for RA, SLE, and pSS. Exclusion criteria for participants included the presence of diabetes, severe hypertension, severe obesity or metabolic syndrome, IBD, cancers, and abnormal liver or kidney function. Individuals who had taken antibiotics or probiotic products within the past 4 weeks were also excluded. Based on these criteria, a total of 95 RA patients, 73 SLE patients, 66 pSS patients, and 118 healthy controls were included for further analysis.

Fecal specimens were collected from participants, temporarily stored on dry ice and transported to the laboratory within 24 h, and stored at -80°C for further analyses. DNA was extracted from fecal samples using the TIANamp Stool DNA Kit (TIANGEN, China) and its quality was assessed using the Qubit 2.0. The extracted DNA samples were stored at -80°C until use. The sequencing library was prepared using the NEB Next Ultra DNA Library Prep Kit (NEB, USA) according to the manufacturer’s instructions, with index codes were added to each sample. Library quality was confirmed using an Agilent 2100 Bioanalyzer. Clustering of the index-coded samples was performed on a cBot Cluster Generation System using an Illumina PE Cluster Kit (Illumina, USA) as per the manufacturer’s instructions. Following cluster generation, the DNA libraries were sequenced on the Illumina NovaSeq platform, generating 150 bp paired-end reads were generated. Quality control and removal of human contaminants were carried out using the same pipeline for publicly available samples.

Gut microbiome profiling of fecal metagenomes

For all samples, the raw metagenomic reads were processed for quality control using fastp59. Low-quality (>45 bases with a quality score < 20 or > 5 ‘N’ bases), low-complexity, and adapter-containing reads were removed. The remaining reads were trimmed at the tails for low quality (<Q20) or ‘N’ bases, and the trimmed reads with length <45 bases were also removed. Contaminating human sequences were eliminated by mapping against the reference human genome (GRCh38) using Bowtie260. Samples with <10 million high-quality nonhuman reads were removed. Finally, 36 studies spanning a total of 6314 fecal metagenomes were retained for follow-up analysis. Taxonomic profiling of the gut microbiome of all samples was analyzed using the MetaPhlAn 426 algorithm. The tool includes a large number of known and uncharacterized species in the human microbiome and clusters them into species-level genome bins (SGBs) for analysis. Its reference database contains ~22,000 known species and 5000 uncharacterized species.

Statistical analysis

Statistical analyses were implemented in the R 4.0.1 platform.

Comparison analyses

For case-control comparison analysis within each dataset, the P-values were calculated using the Wilcoxon rank-sum test. The q value was used to evaluate the false discovery rate (FDR) for the correction of multiple comparisons and was calculated based on the R p.adjust function based on the Benjamini-Hochberg algorithm61. Random effects meta-analysis was implemented in gut microbiome datasets based on the previously developed algorithm35. We employed the meta-analysis combined with the multivariate analysis by linear models (MaAsLin 2) algorithm62 in the revised manuscript to identify disease-associated differential signatures. For the meta-analysis, taxonomic relative abundances were converted to arcsine-square root-transformed proportions using the escalc function in the R metafor package that employs Hedges’ g standardized mean difference statistic to calculate the pooled effect size by a random effects model. Linear mixed effects model analysis was conducted using the lmer function of the LmerTest package63, with the study as a random effect, and the significance of models was calculated using the anova function. MaAsLin 2 allowed us to capture the correlations between diseases and microbial taxa by de-confounding the effects of other demographic parameters. A taxon was defined as differentially abundant if it met the following criteria: (1) random effects meta-analysis q < 0.05, and (2) MaAsLin 2 analysis q < 0.05 after accounting for age, sex, and BMI (considering only datasets with available metadata) as confounding factors.

Multivariate analyses

Clustering analysis of the gut microbiomes was performed based on the species-level relative abundance profiles using the Jensen-Shannon divergence and partition around medoids (PAM) clustering algorithm, following the tutorial of gut microbiome enterotype analysis30. The average silhouette width (ASW) was used to determine the optimal number of clusters, and an ASW of <0.3 suggested poor support for the existence of discrete clusters. PERMANOVA was performed with the R vegan package64, for which the effect size (R2) of disease statuses on microbiome variation was calculated using the adonis function, and the P-value was generated based on 1000 permutations. Principal component analysis (PCoA) and distance-based redundancy analysis (RDA) were performed with the R vegan package based on the Bray-Curtis dissimilarity. The Spearman correlation coefficient and its significance were assessed using the cor.test function.

Prediction analyses

Random forest analysis was realized with the R randomForest package (1000 trees), and the performance of the random forest models was assessed with tenfold cross-validation. LASSO analysis was performed based on the methodology developed by the previous study42. ROC analysis was implemented with the R pROC package, and the AUC was calculated accordingly. The random forest-based GMHI for each metagenomic sample was calculated based on the predictive values averaged from ten random forest classifiers.