Introduction

Recent landmark studies have unveiled profound links between the gut microbiome and a variety of complex, chronic diseases1,2,3,4,5,6,7,8,9. Despite these discoveries, how can we tell if a person has dysbiosis? How can we effectively harness unique microbial signatures to quantitatively track our health? These critical questions stand at the forefront of utilizing the gut microbiome as a precise marker for health and wellness.

The potential of the gut microbiome as a marker for deciphering complex, chronic diseases has captivated the scientific community—in response, we recently developed the Gut Microbiome Wellness Index (GMWI) [previously called the Gut Microbiome Health Index (GMHI)]10. GMWI is a first-of-its-kind stool metagenome-based indicator for assessing health by determining the likelihood of an individual harboring a clinically diagnosed disease solely from their gut microbiome composition, irrespective of the specific disease type10,11. This disease-agnostic index was derived from a comprehensive analysis of a pooled dataset comprising 4347 stool shotgun metagenomes from 34 independent studies. GMWI is a logarithmic ratio of the collective abundances—a term encompassing species-level relative abundances and multiple α-diversity metrics—of health- and disease-associated gut microbial species. Evaluating on the pooled dataset, GMWI exhibited a balanced accuracy (i.e., average of the proportions of healthy and non-healthy samples that were correctly classified) of 69.7% in predicting the presence of clinically diagnosed disease. Specifically, the correct classification rates for healthy (disease-free) individuals and those with non-healthy (diseased) conditions were 75.6% and 63.8%, respectively. Moreover, GMWI achieved a balanced accuracy of 73.7% in a validation cohort of 679 stool metagenomes, with the correct classification rates for the healthy and non-healthy subsets being 77.1% (91 out of 118) and 70.2% (394 out of 561), respectively. Since its original publication in 2020, GMWI has been utilized in studies investigating the impact of environmental12 and genetic/socioeconomic13 factors on the human gut microbiome, as well as in identifying a ‘Longevous Gut Microbiota Signature’ species set13.

Despite the promise of our original GMWI prototype, there are limitations that impede its general applicability. Firstly, GMWI correctly classifies healthy stool metagenomes at a higher success rate than non-healthy ones. This bias may stem from the prevalence-based strategy used to identify health-associated and disease-associated species, which was a fundamental component of the GMWI model. As the non-healthy group encompasses patients with different diseases, this group is inherently heterogeneous; in turn, a prevalence-based strategy may miss subtle taxonomic signatures that are only represented in subsets of non-healthy populations (e.g., cohorts with a specific disease). Secondly, our existing model assigns equal weight to each species without considering potential variances in the importance of individual species. To improve classification accuracy and general applicability, a refined weighting system that accounts for varying strengths of association to host phenotype is needed. Additionally, including gut microbial information from all taxonomic ranks could uncover more features that accurately predict host phenotypes14,15. In this study, we present GMWI2, an advanced iteration of the original GMWI that addresses the above limitations and significantly improves classification accuracy in distinguishing between healthy and non-healthy phenotypes.

Results

Pooled analysis of stool metagenomes across health and disease phenotypes

As in our previous work10, we define “healthy” subjects as those without reported diseases or abnormal body weight conditions (i.e., classified as underweight, overweight, or obese based on reported BMI), whereas “non-healthy” subjects are those confirmed to have a clinical diagnosis of any disease. (Retaining the same definitions for “healthy” and “non-healthy” ensures that the current work represents a continuous refinement of our original GMWI method.) We conducted a pooled analysis of existing 8069 stool shotgun metagenomes (5547 from healthy individuals and 2522 from non-healthy individuals) sourced from 54 independently published studies spanning 26 countries and six continents (Fig. 1a, Table 1, and Supplementary Data 1). These pooled metagenomes are from individuals with one of twelve different health and disease phenotypes (Fig. 1a; healthy, ankylosing spondylitis, atherosclerotic cardiovascular disease, colorectal cancer, Crohn’s disease, Graves’ disease, liver cirrhosis, multiple sclerosis, nonalcoholic fatty liver disease (or also known as metabolic dysfunction-associated steatotic liver disease [MASLD]), rheumatoid arthritis, type 2 diabetes, and ulcerative colitis) from diverse geographies, ethnicities/races, cultures, and balanced sex representation (Fig. 1b). (Our study and sample selection criteria can be found in the “Methods” section. We provide all subjects’ phenotype, age, sex, BMI, and geography [as provided in their respective original study] in Supplementary Data 2.) This substantial increase in sample size, nearly doubling the number of metagenomes included in our previous study, is one notable improvement in GMWI2. Additionally, GMWI2 uses MetaPhlAn316 instead of MetaPhlAn217 for taxonomic profiling, leveraging an extensively expanded marker database for a more comprehensive and accurate characterization of microbial taxa (“Methods” section).

Fig. 1: Conducting a pooled analysis of stool metagenomes across multiple health and disease conditions from a diverse global representation.
figure 1

a A survey was conducted in PubMed and Google Scholar to search for published studies with publicly available human stool shotgun metagenome (gut microbiome) samples from healthy (disease-free) and non-healthy (diseased) individuals. The initial collection of stool metagenomes consisted of 12957 samples from 73 independent studies. All raw metagenome samples (.fastq files) were downloaded and reprocessed uniformly using identical bioinformatics methods. After quality control of sequenced reads, taxonomic profiling was performed using MetaPhlAn3. Studies and samples were removed based on several exclusion criteria. Finally, a total of 8069 samples (5547 and 2522 metagenomes from healthy and non-healthy individuals, respectively) from 54 studies ranging across healthy and 11 non-healthy phenotypes were assembled into a pooled metagenome dataset for downstream analyses. b Demographic summary of the study subjects whose metagenome samples were included in the pooled dataset. Subject demographics, as reported in the original studies, include country of origin (n = 8069), age (n = 4670), and sex (n = 5247).

Table 1 Human stool shotgun metagenome datasets used in this study

All metagenomes underwent uniform reprocessing using an identical bioinformatics pipeline, as described in the “Methods” section. Such practice not only mitigates batch effects18,19, but also bolsters the identification of health- and disease-related gut taxonomic signatures despite the presence of potentially strong confounding factors. Indeed, this is supported by principal component analysis (PCA), where, despite the samples originating from varying sources and conditions, the healthy and non-healthy groups display significantly distinct gut microbiome profiles (Adonis R2 = 1.2%, P = 0.001, PERMANOVA; Fig. 2a). Nevertheless, although the consensus preprocessing of metagenomic data effectively reduces one source of batch effects related to bioinformatics analyses, it is important to recognize that this approach cannot entirely eliminate potential batch effects arising from experimental and technical procedures across different studies. Such factors include differences in how stool samples were collected, stored, and prepared for metagenomic sequencing.

Fig. 2: Gut microbiome taxonomic profiles of healthy and non-healthy individuals inform a Lasso-penalized logistic regression classification model.
figure 2

a Principal component analysis (PCA) of gut microbiome profiles. Significant differences in distributions between healthy (disease-free) (blue, n = 5547) and non-healthy (diseased) (red, n = 2522) groups were observed (P < 0.05, PERMANOVA). Ellipses represent 95% confidence regions. The loading vectors with the top 10 highest PC1 and PC2 magnitudes are shown. b Coefficient values for the Lasso-penalized logistic regression model. The model includes 49 taxa with positive coefficients, 3105 taxa with zero coefficients, and 46 taxa with negative coefficients.

Implementing Lasso-penalized logistic regression in GMWI2

For the classification task of distinguishing between healthy and non-healthy groups, GMWI2 uses a Lasso-penalized logistic regression model instead of the log-ratio equation utilized in the original GMWI. Hence, GMWI2 essentially uses linear regression for its predictions, resembling polygenetic risk score models in statistical genetics20,21. The model was trained on gut microbiome taxonomic profiles (derived from the aforementioned pooled dataset of 8069 stool shotgun metagenomes) spanning all measurable taxonomic ranks to model disease likelihood as a linear function of microbial taxon (i.e., clade) presence or absence. Specifically, the GMWI2 score for an individual sample is defined as the predicted log odds (logit) of the sample originating from a healthy, non-diseased individual. A more comprehensive explanation of how GMWI2 uses Lasso-penalized logistic regression to estimate disease likelihood is detailed in “Methods” section.

The original GMWI approach utilized a prevalence-based strategy to identify health- and disease-associated microbial species. Our current method learns variable feature importances, obviating the need for manual species identification. More specifically, the Lasso-penalized logistic regression model utilized 95 microbial taxa with non-zero coefficients for its predictions, derived directly from the gut microbiome profiles (Fig. 2b and Supplementary Data 3). Interestingly, the majority of taxa characterized by positive and negative coefficients exhibited a higher relative abundance in the healthy and non-healthy groups, respectively (Supplementary Data 4). These identified taxa included 1 class, 3 orders, 4 families, 19 genera, and 68 species. Notably, the coefficient values varied between –0.68 and 0.54, ensuring that each taxon contributes differently to the GMWI2 score according to its relative association strength. This presents a shift from our previous GMWI log-ratio model where equal weight was assigned to each species.

It is worth mentioning that several taxonomic levels exhibited non-zero coefficients in our analysis. This is likely due in part to the interdependence across different levels of taxonomic hierarchy introducing multicollinearity, which complicates the interpretation of regression coefficients. However, our approach in encompassing all taxonomic levels demonstrated higher classification performance compared to when using only a single taxonomic level (Supplementary Table 1). Given our primary objective of optimizing classification accuracy, we chose to prioritize this aspect, leading us to set aside the multicollinearity concern.

In the following sections, we evaluate GMWI2’s proficiency in differentiating healthy from non-healthy individuals. This process can be conceptually structured into four phases:

  1. 1.

    Model training: GMWI2 is trained and evaluated on the full training dataset. This phase utilizes all 8069 samples for computing the logistic regression coefficients (as depicted in Fig. 2b) and determining GMWI2 scores.

  2. 2.

    Cross-validation: GMWI2 undergoes further evaluation through cross-validation (CV) and inter-study validation (ISV) strategies. In contrast to the initial phase, these strategies do not leverage all 8069 samples simultaneously for model training. As a result, the models generated during this phase are intrinsically different from those produced in the first phase. In line with standard cross-validation protocols, the training of the GMWI2 model, including the computation of logistic regression coefficients, is confined strictly to the training partition of each train-test split of the total 8069 samples.

  3. 3.

    Validation on external datasets: The GMWI2 model developed in the first phase is applied to six external datasets to confirm its discriminatory power on independent samples.

  4. 4.

    Demonstration on longitudinal datasets: The GMWI2 model from the first phase is applied to four additional external datasets. These evaluations focus on demonstrating GMWI2’s applicability in longitudinal scenarios.

Enhanced classification of healthy and non-healthy gut microbiomes with GMWI2

GMWI2 scores were calculated for metagenomes by applying the learned coefficients in computing the predicted log odds. A positive GMWI2 value classifies the sample as healthy, indicating disease absence; while a negative GMWI2 value classifies it as non-healthy, denoting disease presence. A GMWI2 of 0 implies an equal weighted presence of positive coefficient taxa and negative coefficient taxa, thereby classifying the sample as neither healthy nor non-healthy. When evaluated on the training dataset (8069 samples), GMWI2 demonstrated a balanced accuracy of 79.9% (correct classification rate in healthy: 79.2%, correct classification rate in non-healthy: 80.6%) and a Cliff’s Delta (d) effect size of 0.75, significantly surpassing the balanced accuracy and Cliff’s Delta reported by our original GMWI model (71.8%, d = 0.63) and traditional species-level α-diversity indices (i.e., Shannon Index, Simpson Index, and richness) (Fig. 3a and Supplementary Data 5). Our results indicate that GMWI2 differentiates between healthy and non-healthy groups much more effectively than GMWI, although both indices were strongly correlated (Pearson’s r = 0.81; Supplementary Fig. 1). Moreover, we found that the gut microbiomes of healthy individuals exhibit significantly higher GMWI2 scores compared to each of the eleven disease phenotypes (Fig. 3b). Lastly, we observed weak correlations between GMWI2 and clinical/demographic characteristics ( | Spearman’s ρ | < 0.3; Supplementary Figs. 2a–g), such as age, BMI, fasting blood glucose, blood cholesterol and triglycerides, indicating that these factors do not significantly influence gut microbiome-based classification outcomes.

Fig. 3: Enhanced classification of healthy and non-healthy stool metagenomes using Gut Microbiome Wellness Index 2 (GMWI2).
figure 3

a GMWI2 best stratifies healthy (n = 5547) and non-healthy (n = 2522) groups compared to GMWI and α-diversity indices (P-values from the two-sided Mann–Whitney U test; d, Cliff’s Delta effect size). Balanced accuracies on the training set are shown for GMWI2 and GMWI. b The healthy group (blue, far left) exhibits significantly higher GMWI2 scores than all 11 non-healthy phenotypes (P-values from the two-sided Mann–Whitney U test). Non-healthy phenotypes include multiple sclerosis (MS, n = 24), ankylosing spondylitis (AS, n = 95), rheumatoid arthritis (RA, n = 151), ulcerative colitis (UC, n = 250), nonalcoholic fatty liver disease (NAFLD, n = 86), type 2 diabetes (T2D, n = 377), Crohn’s disease (CD, n = 284), Graves’ disease (GD, n = 100), colorectal cancer (CC, n = 789), liver cirrhosis (LC, n = 152), and atherosclerotic cardiovascular disease (ACVD, n = 214). c Bins of GMWI2 and GMWI scores (x-axis). The height of the black and gray bars indicate metagenome sample counts in each GMWI2 and GMWI bin, respectively (y-axis, left). Points represent the proportion of samples in each GMWI2 or GMWI bin corresponding to actual healthy and non-healthy individuals (y-axis, right). d Increased magnitude cutoffs result in improved classification performance of GMWI2, showing increasing training set balanced accuracy (blue, y-axis, left) at the expense of decreasing retained samples (orange, y-axis, right). e Classification performances of GMWI and GMWI2 in distinguishing healthy and non-healthy groups. Accuracies (y-axis, left) are depicted for both groups on the training set, leave-one-out cross-validation (LOOCV), and 10-fold CV, using varying magnitude cutoffs (0, 0.5, 1.0) of GMWI and GMWI2 scores. Balanced accuracies are shown between the blue and pink bars, which represent healthy and non-healthy groups, respectively. Orange points represent the proportion of retained samples (y-axis, right) for the corresponding index magnitude cutoff. For 10-fold CV, repeated random sub-sampling was performed ten times, and the average results are displayed. Standard box-and-whisker plots (i.e., center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers) are used to depict groups of numerical data in (a, b).

We subsequently explored whether higher (or more positive) GMWI2 values could indicate enhanced confidence in categorizing stool metagenomes as healthy. Conversely, we examined if lower (or more negative) GMWI2 scores suggest an increased likelihood that a sample could be classified as non-healthy. Indeed, we observed a progressive increase in the proportion of healthy individuals among metagenome samples with increasingly positive GMWI2 scores (Fig. 3c and Supplementary Table 2). Similarly, increasingly negative GMWI2 scores captured larger proportions of the non-healthy subjects. Notably, the proportions of actual healthy and non-healthy samples within the positive and negative bins of GMWI2, respectively, were both higher compared to the same GMWI bins (refer to points in Fig. 3c). This difference in sample distributions between the GMWI2 and GMWI bins underscores GMWI2’s improved capability to differentiate between healthy and non-healthy samples.

The results presented in Fig. 3c of our study revealed an interesting trend. Specifically, when GMWI2 (and GMWI) scores exhibit a more positive or negative value, there is a corresponding increase in the proportion of actual healthy and non-healthy samples, respectively. This trend suggests a potential increase in the confidence of phenotype classification. In contrast, as these values near zero, our confidence in accurately determining the presence or absence of a disease decreases. To examine this point more closely, we next investigated how setting a minimum GMWI2 threshold or cutoff parameter could enhance classification accuracy for phenotype prediction. We observed remarkable improvement in classification performance when considering increasing cutoffs for the magnitude of GMWI2 scores, thereby signifying higher prediction confidence in the retained samples (Supplementary Table 3). For example, when retaining samples with GMWI2 magnitudes equal to or higher than 0.5 (i.e., GMWI2 scores below –0.5 or above +0.5) and 1.0 (i.e., GMWI2 scores below –1.0 or above +1.0), we achieved balanced accuracies of 85.8% and 91.0%, respectively (Fig. 3d). (these cutoffs are examples to illustrate the concept of the GMWI2 magnitude cutoff.) This approach, however, requires excluding samples with GMWI2 magnitudes below these cutoffs, leaving only 6364 (representing 78.9% of the total 8069 samples) and 4712 (58.4% of 8069) samples, respectively. This highlights a significant trade-off: increasing the cutoff improves accuracy but excludes potentially valuable samples from the analysis.

An important observation is that GMWI2 correctly classified healthy and non-healthy stool metagenomes at nearly the same rate (79.2% and 80.6%, respectively) despite imbalanced sample numbers. This contrasts markedly with the original GMWI, which achieved a much higher correct classification rate on healthy samples (Fig. 3e). We also assessed the performance of the GMWI2 model utilizing both leave-one-out cross-validation (LOOCV) and 10-fold cross-validation (10-fold CV) (Fig. 3e). Interestingly, GMWI2 achieved nearly identical balanced accuracies of 79.1% (healthy correct classification rate: 78.6%, non-healthy correct classification rate: 79.5%) and 79.0% (healthy correct classification rate: 78.6%, non-healthy correct classification rate: 79.3%) in LOOCV and 10-fold CV, respectively, nearly matching the performance achieved on the training dataset (79.9%).

Next, we computed classification accuracies using different magnitude cutoffs for the two cross-validation methods (Fig. 3e). Remarkably, GMWI2 achieved a balanced accuracy of 90.4% and 90.2% in LOOCV and 10-fold CV, respectively, on the samples with scores below –1.0 or above +1.0. These balanced accuracies were very close to those observed in the training set (91.0%). In contrast, when applying the same criteria to GMWI (i.e., cutoff of 1.0), the balanced accuracy drops considerably to 78.6%. In all, these results emphasize the notable improvements achieved with GMWI2 over GMWI.

Evaluating the robustness of GMWI2 across study populations of varying sample sizes

Although studies with small sample sizes were excluded from the training set (see study exclusion criteria in Fig. 1a and “Methods” section), in general, it is crucial to validate any classification model on datasets of varying sample sizes19. To this end, we conducted inter-study validation (ISV) to assess the impact of batch effects (i.e., technical or biological variations associated with the study population or site characteristics) on GMWI2 performance stability. In this approach, we iteratively excluded a single study, trained the GMWI2 model on the remaining studies, and evaluated its classification performance on the held-out study22. (The excluded study essentially becomes the independent validation [or test] cohort.) An important aspect of ISV is that it can showcase the significant variability in classification performance that can arise depending on the choice of validation set. For our study, it provides a range of classification accuracies achievable when applying GMWI2 across 54 independent validation sets.

Figure 4a specifically displays the performance of GMWI2 across the full range of held-out studies, along with details on their sample sizes. Despite the variation in classification performance across different studies (see gold points indicating ISV classification accuracy per study in Fig. 4a and Supplementary Table 4), the average balanced accuracy was 75.8%. This performance rose to 86.9% when considering samples with GMWI2 scores lower than –1 or higher than 1 (Supplementary Table 4). In all, our analysis revealed no discernible correlation between the model’s predictive performance and the sample size of the held-out datasets.

Fig. 4: Inter-study validation (ISV) shows effective generalization of GMWI2 across diverse study populations.
figure 4

a Classification accuracy on each excluded study in ISV is displayed by gold points (y-axis, right). The studies on the x-axis are rank-ordered based on either accuracy for a single phenotype (healthy or non-healthy) or balanced accuracy in the case of both phenotypes. The stacked bars illustrate the number of healthy (blue) and non-healthy (pink) stool metagenome samples in each study (y-axis, left). b Receiver operating characteristic curves for classification performance in distinguishing healthy and non-healthy phenotypes on the training set, 10-fold CV, and ISV.

The classification performances obtained from ISV exhibited minimal disparity compared to the performances achieved by LOOCV and 10-fold CV, which do not consider study boundaries. The small discrepancy between these strategies shows GMWI2’s resilience against batch-related biases, indicating that GMWI2 generalizes effectively across stool metagenomes, regardless of the subjects’ origins. Further evidence of this robustness is demonstrated by the area-under-the-curve (AUC) metrics in the training set, 10-fold CV, and ISV, achieving AUCs of 0.88, 0.87, and 0.84, respectively (Fig. 4b).

Demonstration of GMWI2 predictive capability on independent sample sets

To confirm GMWI2’s predictive capability for distinguishing between healthy and non-healthy individuals, we compiled an external validation dataset consisting of 1140 stool metagenome samples from six published studies (Supplementary Data 6). This dataset includes samples from healthy individuals and patients diagnosed with ankylosing spondylitis, pancreatic cancer, or Parkinson’s disease. All metagenome samples in this validation dataset (Supplementary Data 7) were classified into either healthy or non-healthy groups in the same manner as demonstrated above.

Consistent with our findings from the discovery cohort (or training data), GMWI2 scores from stool metagenomes of the healthy validation group (n = 494) were significantly higher than those of the non-healthy validation group (n = 646) (P = 1.6 × 10–43, two-sided Mann–Whitney U test; Cliff’s Delta = 0.48; Fig. 5a). The balanced accuracy achieved was 72.1%, which is comparable to the average balanced accuracy of 75.8% observed in our ISV analysis. With magnitude cutoffs of 0.5 and 1.0, the balanced accuracy improved to 75.4% and 80.1%, respectively, while still retaining 74.3% and 49.3% of the samples.

Fig. 5: GMWI2 performance on healthy and non-healthy external validation cohorts.
figure 5

a GMWI2 scores from healthy (494 samples) and non-healthy (646 samples) groups. Scores are significantly higher in the healthy group compared to the non-healthy group (P = 1.6 × 10–43; two-sided Mann–Whitney U test). The effect size is represented by Cliff’s Delta (d = 0.48). The balanced accuracy of the classification is 72.1%. b GMWI2 scores across five healthy (H1–H5) and three non-healthy cohorts (AS4 ankylosing spondylitis, PD6 Parkinson’s disease, PC5 pancreatic cancer). The superscript numbers adjacent to phenotype abbreviations correspond to specific studies detailed in Supplementary Data 6. Asterisk (*) indicates significantly higher score in a healthy cohort compared to the corresponding non-healthy cohort (P < 0.01, two-sided Mann–Whitney U test. Exact P-values provided in Supplementary Data 6). Numbers next to each asterisk refer to the healthy cohort compared against each non-healthy condition. Sample size of each group or cohort are shown in parentheses. Standard box-and-whisker plots (i.e., center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, outliers in (a) or individual GMWI2 scores in (b)) are used to depict groups of numerical data.

To further examine GMWI2 performance on the external validation data, we analyzed the eight total cohorts (defined by unique phenotype per study), spanning five healthy and three non-healthy phenotypes. As shown in Fig. 5b, four of the five healthy cohorts (H1–H4) were found to have significantly higher GMWI2 distributions than all three non-healthy phenotype cohorts (P < 0.01, two-sided Mann–Whitney U test). Classification accuracies for the five healthy cohorts were as follows: 96.3% (130 of 135) for H1, 91.2% (52 of 57) for H2, 83.3% (25 of 30) for H3, 56.8% (21 of 37) for H4, and 28.1% (66 of 235) for H5. Alternatively, classification accuracies for the three non-healthy cohorts were 90.7% (39 of 43) for pancreatic cancer (PC5), 81.2% (398 of 490) for Parkinson’s disease (PD6), and 80.5% (91 of 113) for ankylosing spondylitis (AS4). Notably, GMWI2 performed well (81.2%) in predicting adverse health in Parkinson’s disease, although stool metagenomes from patients with this neurodegenerative disorder were not part of the original discovery set. Furthermore, despite the relatively poor classification performance in the H5 cohort (28.1%), the GMWI2 scores in H5 were significantly higher than those in the PC5 pancreatic cancer group from the same study. Overall, the robust reproducibility of GMWI2 on an external validation dataset suggests that a generalized disease-associated signature of gut microbiome dysbiosis across multiple diseases was effectively captured during dataset integration and index formulation.

Gut health tracking in longitudinal studies

We applied GMWI2 to stool metagenomes obtained from four recently published longitudinal gut microbiome studies. Importantly, these samples were not part of the initial pool of 8069 metagenomes used to train GMWI2. Here, our aim was to illustrate GMWI2’s versatility by demonstrating it towards gut microbiome health tracking, thereby extending its applicability beyond the originally intended case vs. control scenarios. Our index for quantitatively monitoring gut health can be likened to using a cholesterol and glucose test for evaluating cardiovascular and metabolic health over time.

Using data from the first study23, we analyzed stool metagenomes from 22 individuals with irritable bowel syndrome (IBS) before and six months after receiving fecal microbiota transplantation (FMT) from two healthy donors. Among the participants, 14 reported symptom relief after FMT (“Effect” group), while 8 did not experience symptom relief (“No Effect” group) despite both groups demonstrating a significant increase in species richness at six months following FMT (P < 0.05, one-sided Wilcoxon signed-rank test; Supplementary Fig. 3). However, only the individuals in the “Effect” group exhibited a significant increase in GMWI2 (P < 0.05; Fig. 6a and Supplementary Table 5). Likewise, an increase in the species-level Shannon Index was observed only in the “Effect” group (P < 0.05; Supplementary Fig. 4). Overall, these findings suggest that while α-diversity metrics, such as richness and Shannon diversity, may yield conflicting conclusions, changes in GMWI2 could serve as a marker of subjects’ phenotypes following FMT treatment for IBS. Furthermore, in light of the clinical significance and the complexities involved in donor screening for FMT24,25, computational tools such as GMWI2 (given its more nuanced definition of gut health) may be able to help guide the selection of suitable healthy donors and their stool samples.

Fig. 6: Reanalysis of existing longitudinal gut microbiome studies with GMWI2.
figure 6

a Changes in GMWI2 in patients with irritable bowel syndrome observed six months (6-mo) after undergoing fecal microbiota transplantation. Only subjects experiencing symptom relief (“Effect” group) displayed a significant increase in GMWI2 (P = 0.039, one-sided Wilcoxon signed-rank test). n, number of FMT donor samples (17 total samples from two healthy donors) or number of FMT recipients. b GMWI2 scores for dietary groups (EEN, Vegan, and Omnivore) at baseline and at the first 5–6 days of dietary intervention. The EEN group showed significant changes in GMWI2, with values significantly decreased by day 2 and thereafter (P < 0.05, two-sided Wilcoxon signed-rank test). No significant change in GMWI2 was observed for the Omnivore and Vegan groups compared to baseline. n, number of unique individuals who each provided a stool sample per time point. c GMWI2, Shannon Index, and species richness before and after antibiotic intervention. Despite recovery in Shannon Index and species richness at day 42 and day 180, respectively, GMWI2 remained significantly lower compared to day 0, suggesting incomplete gut microbiome recovery even after ~6 months (P < 0.05, two-sided Wilcoxon signed-rank test). n, number of unique individuals who each provided a stool sample per time point. d GMWI2 of gut microbial communities after 24-h in vitro fecal fermentation with five different prebiotic oligosaccharides. The experiment was conducted in triplicates for each study group. The height of the bars represents the mean GMWI2 (numbers inside the solid bars), and error bars indicate the standard deviation from the mean. Points represent individual triplicate samples. Different small letters above the bars denote groups with significant differences in GMWI2 as determined by Tukey’s HSD test (P < 0.05). Control groups: NS0, no substrate addition at 0 h; NS24, no substrate for 24 h. Prebiotic groups: FS24 fructooligosaccharide, IN24 inulin, GS24 galactooligosaccharide, XS24 xylooligosaccharide, FL24 2’-fucosyllactose. Standard box-and-whisker plots (i.e., center line, median; box limits, upper and lower quartiles; whiskers, 1.5× interquartile range; points, individual GMWI2 scores or α-diversity values) are used to depict groups of numerical data in (ac).

In the second study26, we investigated the effects of diet. We calculated GMWI2 for stool metagenomes obtained from 30 healthy volunteers before and during a dietary intervention. Three groups of participants were studied: Vegan (self-reported vegans who resumed their regular diet), Omnivore (participants who consumed a standard diet of both animal and plant origin), and Exclusive Enteral Nutrition (EEN) (participants with an omnivorous diet who went on to consume a synthetic, fiber-free diet for the duration of the study). Stool samples were collected at baseline and each day during the dietary intervention. We observed that the GMWI2 scores for both the vegan and omnivore subjects remained relatively stable throughout the intervention period of five to six days (Fig. 6b). However, GMWI2 for the EEN group significantly decreased relative to baseline by the second day and onwards (P < 0.05, two-sided Wilcoxon signed-rank test; Fig. 6b and Supplementary Table 6) while α-diversities did not significantly change across the groups (Supplementary Fig. 5). These results suggest that the removal of dietary fiber may lead to a rapid decrease in overall gut health, an early change detected solely by GMWI2 and not by α-diversity metrics. Overall, our findings strengthen the evidence for the well-established benefits of dietary fiber on health27,28,29.

For the third study30, we calculated GMWI2 for stool metagenomes from twelve healthy young adults who underwent a 4-day exposure with broad-spectrum antibiotics (meropenem, gentamicin, and vancomycin). Here, stool samples were collected before the exposure, and then again at 4, 8, 42, and 180 days post-intervention. While species-level α-diversity measures (Shannon Index and richness) indicated that the gut microbiome may have recovered somewhat by day 42 or 180, GMWI2 did not demonstrate any recovery trend even by day 180 (Fig. 6c and Supplementary Table 7). These findings reflect deleterious post-intervention taxonomic shifts originally noted by Palleja et al., such as the rise in previously undetectable Clostridium spp., and the disappearance of probiotic members of Bifidobacterium and butyrate producers Coprococcus eutactus and Eubacterium ventriosum. Our results therefore offer a novel perspective on the long-term impact of short-term broad-spectrum antibiotic intervention on gut microbiota and suggest that GMWI2 could be a valuable tool for assessing gut microbiome recovery following an acute illness.

In the final study31, we examined the effect of various oligosaccharides on gut microbial communities. In this study, Lee et al. used GMWI to assess the prebiotic effect of oligosaccharides, with broader implications for designing personalized diets based on their impact on gut microbiome wellness. Herein, 19 healthy adult volunteers (14 men and 5 women) provided fecal samples, which were then combined and well-mixed. Then, fructooligosaccharides (FOS), galactooligosaccharides (GOS), xylooligosaccharides (XOS), inulin (IN), and 2′-fucosyllactose (2FL) were separately mixed with portions of the homogenized fecal samples in a 24-h in vitro anaerobic batch fecal fermentation system. Two control groups were also included: one without substrate addition at 0 h (NS0) and another without substrate addition for 24 h (NS24). The experiment was conducted in triplicates for each of the seven study groups.

GMWI2 was calculated for all fecal samples (Fig. 6d and Supplementary Table 8), thereby replicating the original study with our new index. Consistent with previous findings, the NS24 group exhibited a lower average GMWI2 than the NS0 group, indicating a less healthy and more disease-associated state. Notably, the addition of the three prebiotics (FOS, IN, and GOS) resulted in significantly higher GMWI2 compared to NS0 (P < 0.05, Tukey’s HSD test). Also, these same three prebiotics, along with XOS, led to significantly higher GMWI2 relative to NS24 (P < 0.05). However, unlike the GMWI2 results, traditional α-diversity metrics (Shannon Index, species richness, species evenness, and inverse Simpson’s Index) were reported to have significantly lower values in all prebiotic treatment groups compared to the NS0 group (P < 0.05)31. Therefore, at least in the in vitro fermentation setting, intake of these four prebiotics could potentially stimulate the growth of gut microbial species associated with healthy conditions, an effect observed solely by using GMWI2.

Discussion

Recent research into the human gut microbiome has highlighted its potential to inform the development of innovative tools for predictive healthcare32,33,34,35,36,37. In this regard, we introduce GMWI2, a robust predictor of health status based on gut microbiome taxonomic profiles that display significant technological advances compared to its prototype (GMWI). Our extensive multi-study analysis, pooling 8069 stool shotgun metagenomes from 54 published studies, encompasses a diverse range of demographics from 26 countries across six continents to identify the biological signals linking gut taxonomies to human health. Delivering a cross-validation balanced accuracy of approximately 90% for higher confidence samples, GMWI2 establishes its strong reliability as a classifier that distinguishes between healthy and non-healthy phenotypes. Furthermore, by revisiting and reinterpreting data from previously published datasets, GMWI2 can offer novel perspectives even for the established understanding of the impact of dietary influences, antibiotic exposure, and FMT on the gut microbiome. Lastly, this study highlights the importance of extensive data sharing in fostering robust machine learning applications, and in demonstrating resilience to batch effects and biases22,38,39,40.

In our analyses in which we incrementally increased the GMWI2 magnitude cutoff, we recognize an inverse relationship between classification accuracy and the volume of samples eligible for class prediction. Therefore, constraining this magnitude cutoff to a single value may not be universally applicable; instead, the selection of this parameter should be flexible and determined by the user, tailored to the specific context and acceptable accuracy thresholds of their individual datasets. In other words, users can select their desired GMWI2 magnitude cutoff based on their confidence level preference in the predictions. This user-driven approach, which offers flexibility between high confidence in a limited dataset and broader range predictions with lesser confidence, is a distinct advantage of our method over traditional binary-output machine learning techniques. Moreover, our findings thus foster the potential utility of a “reject option”41,42 for low GMWI2 magnitudes, which can serve as a criterion to redirect relatively uncertain predictions to other screening methods—this concept captures the understanding that certain aspects of health and disease are not fully explainable solely by the gut microbiome.

Our study, while providing insights into the predictive capabilities of the gut microbiome, has some limitations that need to be acknowledged. First and foremost, we emphasize that GMWI2 scores reflect an association with health status, which we define in terms of the presence or absence of disease. It is important to understand that these scores do not imply a causal relationship with (nor are they intended to replace) direct clinical health measures, such as the detection of pathogenic organisms in the gastrointestinal tract, gut motility characteristics, metabolic profiles, serological markers, blood inflammatory markers, or fecal calprotectin levels. Second, the model could benefit from the inclusion of more intricate microbiome features such as species growth rates, strain details, and functional potential. Incorporating these important factors may improve predictive accuracy and offer a richer perspective on the intricate mechanisms tying the gut microbiome to overall human health. Third, we made concerted efforts to ensure that our pooled stool metagenomic dataset exhibits a diverse representation of geographies, races, and cultures. Nevertheless, future work should emphasize wider participant inclusion, especially from underrepresented areas and ethnicities, to truly globalize gut microbiome research. Additionally, loosening our selection criteria will allow us to incorporate metagenomes from a broader range of disease phenotypes (like neurodegenerative and psychiatric disorders) and reach even more diverse demographics. Such expansion could enhance the model’s generalizability across different populations. Fourth, although we utilized taxonomic information down to the species level, there’s a potential missed opportunity in not focusing on microbial strains, which often bear more clinical significance. While our method surpasses the genus-level limitations of 16S rRNA gene amplicon sequencing, it doesn’t account for the variability among strains of the same species. Fifth, our analysis revealed that well-known pathogens, including Enterococcus faecium/faecalis, did not display negative coefficients in our GMWI2 framework. Nevertheless, we did observe negative coefficients for certain opportunistic pathogenic taxa, notably among various Clostridium species, as detailed in Supplementary Data 4. It is important to emphasize that the determination of pathogenic traits is more accurately conducted at the strain level, which falls outside the scope of our model. Additionally, it is widely acknowledged that not every gut microbiome associated with chronic, non-communicable diseases necessarily harbors invasive pathogens. Sixth, we recognize that the compositional shifts between healthy and non-healthy identified by our model might be influenced by variables such as transit time, stool consistency, and other factors not captured in our meta-data. This is a valid consideration for individual samples. However, in our analysis of over 8000 metagenome samples, our assumption is that such variables are likely to be evenly (randomly) distributed or have minimal impact on the overall performance of the GMWI2 tool, given the breadth and reasonable diversity inherent in our study’s sample population. Last, our definitions of healthy (i.e., self-reported absence of a disease or disease-related symptoms) and non-healthy (i.e., patients with a clinical diagnosis of a disease) are consistent with those used in our previous studies10,11, as the current work represents a continuous refinement of our previous method. However, we have not investigated how subtle variations in these definitions may impact GMWI2 classification accuracy. Analyzing this aspect is a potential area for future research.

In regard to its translational potential, GMWI2 is designed to offer a novel method for dynamically monitoring an individual’s health in a semi-real-time manner through the analysis of gut microbiome taxonomic profiles. While our index is explicitly trained to distinguish between healthy and diseased gut microbiomes, it also provides a practical approach to approximating pre-diseased states. This is achieved by interpolating between the healthy and diseased states, allowing GMWI2 to reveal variations across the gut microbiome health spectrum. Specifically, assuming sufficient prediction quality of our model, an individual’s GMWI2 score will decrease as they transition from healthy to pre-diseased to diseased states, or increase if transitioning in the reverse direction. Moreover, GMWI2 provides a pragmatic alternative to the resource-intensive collection of longitudinal gut microbiome datasets needed to precisely track the steady transition from healthy to diseased. Current efforts in this area are very limited in scale and costly.

In all, GMWI2 is not intended for confirming specific disease diagnoses but rather serves as an early warning system, akin to a “canary in a coal mine”. It is designed to detect potentially adverse shifts in overall gut health before specific, diagnosable symptoms occur. Such detection could inform dietary or lifestyle modifications to prevent mild issues from escalating into severe health conditions, or prompt further diagnostic tests. Unlike existing disease-specific indices, our index spans multiple diseases, thereby emphasizing a pan-disease (or alternatively, a generally healthy) gut microbiome signature. This broad applicability could be particularly useful in clinical scenarios such as selecting FMT donors, where gut health could be taken as a reflection of overall health. In conditions like rheumatoid arthritis and other autoimmune inflammatory disorders, GMWI2 could guide decisions on tapering or discontinuing therapy, or assessing the possibility of disease flares. In this sense, GMWI2 may potentially usher in a transformative era in gut microbiome-centric health analytics, allowing for nuanced health evaluations tailored to individual microbial signatures. Looking ahead, integrating GMWI2 into a larger decision network alongside other biomeasurements (e.g., multi-omics, wearables) and AI models has the potential to open exciting possibilities for healthy aging43 and preventative health screening and wellness programs44,45, driven by insights from the gut microbiome.

Methods

Multi-study pooling of human stool metagenomes

We conducted a comprehensive literature search using targeted keywords such as “gut microbiome”, “stool metagenome”, and “whole-genome shotgun” in PubMed and Google Scholar. The search was performed up until January 2022 to identify published studies that included publicly available shotgun metagenomic data of human stool samples, along with corresponding subject meta-data. In cases where multiple samples were collected from individuals across different time points, we included only the first or baseline sample from that study subject. Studies involving dietary or medication interventions were not included in the pooled dataset for GMWI2 training. Studies with fewer than 40 samples were also excluded from our analysis, considering the potential limitations in the robustness and reliability of microbiome data from such pilot-scale microbiome studies. The raw sequence files (in .sra or .fastq format) were retrieved from the NCBI Sequence Read Archive and European Nucleotide Archive databases for further analysis.

Stool metagenome sample exclusion criteria

To minimize potential bias and preserve data integrity, we applied stringent criteria to the stool shotgun metagenome samples for inclusion in our study. Specifically, we excluded samples sequenced using non-Illumina platforms, such as 454 GS FLX Titanium, Ion Torrent PGM, Ion Torrent Proton, and BGISEQ-500, to ensure consistency in sequencing technology. In terms of data quality, we excluded samples with low read counts (below 1 million reads) prior to quality control filtration. Additionally, our analysis did not include samples from studies with a primary focus on the virome or those where stool samples underwent virus-like particle purification.

Furthering our strict sample control standards, we also excluded disease control samples that were not specifically tied to a clinical diagnosis in the originating study. Individuals who were not clinically diagnosed with a specific disease but exhibited certain anomalous conditions were also excluded. These conditions comprised: (i) a Body Mass Index (BMI) suggestive of being underweight (BMI < 18.5), overweight (BMI ≥ 25 and <30), or obese (BMI ≥ 30) were not classified as a non-healthy phenotype; (ii) declared heavy drug use (including alcohol and recreational drugs); (iii) age exceeding 100 years; and (iv) individuals initially healthy at baseline, but later reported to develop a disease condition during a longitudinal study. Additionally, samples from newborn, infant, and child gut microbiome studies were excluded since the primary focus was on adult human gut microbiomes. Lastly, we excluded non-healthy individuals with early-stage diseases (e.g., impaired glucose tolerance, hypertension, colorectal adenoma), rare or genetically-linked disorders (e.g., Behcet’s disease, schizophrenia), and non-colon cancers (including pancreatic, non-small cell lung, and breast cancer). These exclusions were applied to ensure a uniform and representative dataset for training GMWI2.

Quality control of sequenced reads

Potential human contamination was filtered out by removing reads that aligned to the human genome (reference genome GRCh38/hg38) using Bowtie246 v2.4.4 with default parameters. Along with Illumina universal adapter sequences, probable adapter sequences were identified by extracting overrepresented sequences from each metagenome sample using FastQC47 v0.11.8. Adapter sequence clipping and quality filtration were performed using Trimmomatic48 v0.39. Specifically, Trimmomatic’s “ILLUMINACLIP” step was used, using a maximum seed mismatch count of 2, palindrome clip threshold of 30, simple clip threshold of 10, and minimum adapter length of 2 bp. Additionally, leading and trailing low-quality bases (Phred quality score < 3) of each read were removed, and trimmed reads shorter than 60 bp in nucleotide length were discarded.

Taxonomic profiling

After performing quality filtration on all raw metagenomes, taxonomic profiling was carried out using the MetaPhlAn316 v3.0.13 phylogenetic clade identification pipeline using default parameters. Briefly, MetaPhlAn3 classifies metagenomic reads to taxonomies based on a database (mpa_v30_CHOCOPhlAn_201901) of clade-specific marker genes. Once taxonomic features (or clades) of unknown/unclassified identity were removed, the remaining clades that could be detected in at least one metagenome sample in the pooled dataset were considered for further analysis.

After taxonomic profiling, the following metagenomes were discarded from our analysis: (i) samples composed of >90% unmapped reads; (ii) samples with a relatively high proportion (>25%) of unknown taxa; and (iii) samples lacking sufficient taxonomic diversity (<100 identified taxa). These samples were removed to maintain the quality and reliability of the training data. Finally, after applying all exclusion criteria, studies with fewer than 20 remaining samples were removed.

Generating presence/absence taxonomic profiles

To mitigate concerns related to the compositional nature of microbiome data49, batch effects, and to simplify the interpretation of the GMWI2 classification model, we transformed the taxa relative abundances from MetaPhlAn3 into a binary presence/absence profile for each metagenome sample. Specifically, a taxon was deemed “present” in a given sample if its relative abundance in a sample was equal to or greater than 0.00001 (or 0.001%), and considered absent otherwise. Consequently, each sample was represented as a binary vector.

PCA and PERMANOVA analysis on taxonomic profiles

Principal component analysis (PCA) was conducted on the presence/absence taxonomic profiles using the “prcomp” function in R. Additionally, Bray-Curtis distance matrices were generated based on the relative abundances of microbial taxa (ranging from phylum to species) in the stool metagenomes. This was done using the “vegan” package v2.6.4 in R. We then carried out permutational multivariate analysis of variance (PERMANOVA) on the distance matrix using the “adonis2” function. To evaluate the influence of the subjects’ health status on the total variance in gut microbial community composition, we calculated the P-value for the test statistic (pseudo-F) based on 999 permutations.

Estimating disease likelihood using Lasso-penalized logistic regression

A Lasso-penalized logistic regression model (Python library “scikit-learn” v1.0.2) was trained on the binary presence/absence taxonomic profiles of the entire pooled dataset of 8069 metagenomes to predict disease presence. The L1 (Lasso) penalty was utilized with the LIBLINEAR solver50. The random state was set to 42, and the class weight was set to “balanced” in order to account for the unbalanced class proportions in our pooled dataset. Hyperparameter tuning—specifically the selection of the regularization parameter \(C\)—was achieved through nested cross-validation that implements the inter-study validation (ISV) framework. Herein, we evaluated various candidates and selected the value that yielded the optimal classification performance in ISV (Supplementary Table 9; see table footnote for our nested cross-validation protocol). \(C\) = 0.03 consistently emerged as the optimal hyperparameter within each outer-loop training fold and was thus selected for the final GMWI2 model.

Let \({{{\boldsymbol{x}}}}_{i}\) be a binary vector encoding the presence or absence of n taxa in the ith labeled sample:

$${{{\boldsymbol{x}}}}_{i}=\left[{x}_{i}^{1},\, {x}_{i}^{2},\cdots,\, {x}_{i}^{n}\right]$$
(1)

where \({x}_{i}^{j}\) is 1 if taxa \(j\) is present in sample \(i\) and 0 otherwise. Additionally, n = 3200 is the number of taxonomic features (or clades) in the ith sample (a total of 3200 taxonomic features were observed at least once in the pooled metagenome dataset).

Let \({y}_{i}\) represent the health status (1 for healthy, 0 for non-healthy) of sample i. The subsequent log-loss optimization objective function is solved using L1 regularization and class proportion weights as follows:

$${\theta }^{*}={{{\rm{argmin}}}}_{\theta \in {{\mathbb{R}}}^{n}}C{\sum }_{i=1}^{m}\alpha \left({y}_{i}\right)\left[\left(\right.\!-{y}_{i}\log \left({h}_{\theta }\left({{{\boldsymbol{x}}}}_{i}\right)\right)-\left(1-{y}_{i}\right)\log \left(1-{h}_{\theta }\left({{{\boldsymbol{x}}}}_{i}\right)\right)\right]+\parallel \theta \parallel _{1}$$
(2)

where \({\theta }^{*}\) refers to the learned coefficient vector, \(C\) is the aforementioned inverse regularization strength parameter, m = 8069 represents the total number of samples in the pooled metagenome dataset, \(\alpha\) is the class proportion weight term, and \({h}_{\theta }({{{\boldsymbol{x}}}}_{i})\) is the hypothesis function:

$${h}_{\theta }\left({{{\boldsymbol{x}}}}_{i}\right)=P\left({y}_{i}=1{{\rm{|}}}{{{\boldsymbol{x}}}}_{i},\theta \right)=\sigma \left({\theta }^{T}{{{\boldsymbol{x}}}}_{i}\right)=\frac{1}{1+{e}^{-{\theta }^{T}{{{\boldsymbol{x}}}}_{i}}}$$
(3)

where \(\sigma\) is the sigmoid function. The class proportion term \(\alpha\) accounts for the relatively unbalanced class proportions in the pooled dataset:

$$\alpha \left({y}_{i}\right)=\frac{m}{2\mathop{\sum }_{j=1}^{m}\left[{y}_{i}={y}_{j}\right]}$$
(4)

Using GMWI2 as a stool metagenome-based health status classifier

We calculated GMWI2 scores for all 8069 stool metagenomes in the pooled dataset, as well as samples from the four gut microbiome case studies. The taxonomic profile of a metagenome was represented as a vector \({{{\boldsymbol{x}}}}_{{\mbox{test}}}\), with binary values that encoded the presence or absence of microbial taxa. The computation employed the predicted log odds (logit) using the previously learned coefficient vector \({\theta }^{*}\):

$${{\mbox{GMWI}}}2\left({{{\boldsymbol{x}}}}_{{\mbox{test}}}\right)={\left({\theta }^{*}\right)}^{T}{{{\boldsymbol{x}}}}_{{\mbox{test}}}$$
(5)

For classification purposes, a predetermined magnitude cutoff parameter \(c\) was utilized (\(c=\,0\) in case of having no cutoff or defer option). Finally, GMWI2 was computed on a metagenome \({{{\boldsymbol{x}}}}_{{test}}\) while applying the following criteria:

$${{\mbox{classify}}}\left({{{\boldsymbol{x}}}}_{{\mbox{test}}}\right)=\left\{\begin{array}{cc}{{\mbox{non{\mbox{-}}}healthy}} & {{\mbox{GMWI}}}2\left({{{\boldsymbol{x}}}}_{{\mbox{test}}}\right) < -c \hfill \\ {{\mbox{defer}}} & -c\le {{\mbox{GMWI}}}2\left({{{\boldsymbol{x}}}}_{{\mbox{test}}}\right)\le \,c \hfill \\ {{\mbox{healthy}}} & {{\mbox{GMWI}}}2\left({{{\boldsymbol{x}}}}_{{\mbox{test}}}\right) > \,c \hfill\end{array}\right.$$
(6)

Of note, our current methodology does not inherently categorize gut microbiome samples into a third option. GMWI2 yields a continuous score, where the sign (negative or positive) is indicative of disease presence or absence, respectively; and higher magnitudes imply greater confidence in the prediction. The “defer” (or “not determined”) category is an optional feature, applicable when a user decides to implement a non-zero GMWI2 magnitude cutoff \(c\). Scores falling below this user-defined cutoff (e.g., between –1.0 and +1.0) can be classified as “defer.”

Evaluation of classification performance

Balanced accuracy, defined as the average of the proportions of correctly classified healthy and non-healthy samples, was used to evaluate the performance of the GMWI2 classification model. This was done across different cutoff parameters (c) using multiple validation techniques: training on the entire dataset and then testing on the same training set, 10-fold cross-validation (10-fold CV), and leave-one-out cross-validation (LOOCV). In order to account for variability in 10-fold cross-validation, the process was repeated 10 times with shuffled fold partitions, and the results were averaged across all runs. Additionally, inter-study validation (ISV) was conducted, in which a single study was held out each time, the model was trained on the remaining studies, and testing was performed on the samples of the single-held-out study. ISV allows for an assessment of classification performance across different studies.

Reporting summary

Further information on research design is available in the Nature Portfolio Reporting Summary linked to this article.