Introduction

With biobank and genetic data integrated with electronic health records (EHR) comes the development of predictive models designed for healthcare applications. There is an urgent need to develop robust modeling pipelines using machine learning (ML) to determine whether EHR-derived common clinical risk factors can predict the phenotype of interest and whether adding genetic factors can improve model performance. Several factors have to be considered during the model development, including (1) selection bias for the biobank data; (2) case–control imbalance; (3) temporal bias in feature acquisition; (4) impact of confounding factors; (5) optimal model selection to capture multi-way interactions; and (6) predictive power and generalizability of the final models. Here we choose Clostridioides difficile infection (CDI) as a proof-of-concept given its complexity and clinical importance. The purpose of this study is to develop an integrated pipeline for predicting symptomatic CDI using common EHR-derived clinical and genetic risk factors. Focusing on symptomatic CDI is driven by the fact that testing and treatment for CDI are not recommended in asymptomatic individuals1.

CDI is considered the most common cause of healthcare-associated diarrhea and is listed as one of the top five urgent antimicrobial resistance threats by the Centers for Disease Control and Prevention (https://www.cdc.gov/drugresistance/biggest-threats.html). Existing literature reporting CDI prediction focuses on three outcomes—symptomatic infection, severity of the infection, and recurrence. The prediction models developed in this field vary by setting, patient recruitment, data source, study design, feature selection, and algorithms. EHR-based studies have become popular due to their improved predictability, specificity, and generalizability. Host genetic susceptibility to CDI and epidemiology of C. difficile strains have been topics of investigations2,3. Intestinal inflammatory cytokines correlate more closely to disease severity than pathogen burden4. The same mechanism (inflammatory cytokines) applies to the inflammatory cytokine signature (plasma level of IL-6, IL-8, and TNF-α) for the prediction of COVID-19 severity and survival5. A previous candidate gene approach revealed that genetic polymorphisms, rs4073(–251T>A) or rs2227306(+ 781T/C) from a pro-inflammatory cytokine, IL-8, are associated with IL-8 production and predisposition to CDI6,7,8 with functional impact (eFigure 1). Genetic markers are not yet included in any established disease scoring system or clinical decision tool for risk stratification of CDI due to unclear causality, small effect size, complex gene by environment (GxE) interaction, and data availability/accessibility.

Materials and methods

The study was conducted and reported according to the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) guideline9. The Geisinger Institutional Review Board approved this study to meet “Non-human subject research” using de-identified information. All research was performed in accordance with relevant guidelines/regulations. Geisinger built and performs regular updates to the de-identified structured EHR database for research linked to the MyCode Community Health Initiative biorepository10,11,12. The structured EHR and matching genetics data allowed us to conduct a retrospective study on primary CDI3. Informed consent was obtained from all subjects and/or their legal guardian(s) for the MyCode patients. The analysis pipeline, as well as CDI-specific study parameters, are illustrated in Fig. 1.

Figure 1
figure 1

A flowchart illustrated the sample size and the pipeline for the prediction model development.

Robust phenotyping algorithm

The phenotype algorithm used to identify CDI cases and controls from EHR data was developed by eMERGE entitled “Phenotype Algorithm Pseudo Code (August 16, 2012)” and collected at PhenoKB (https://phekb.org/phenotype/clostridium-difficile-colitis). This algorithm adopted the golden standard, which used laboratory test data to identify CDI cases. Based on clinical symptoms, we first identified adults (age ≥ 18) with CDI from the Geisinger EHR. Three or more consecutive liquid stools within a day may be tested for C. difficile based on recommended guidelines13,14. The polymerase chain reaction was used as the laboratory reference standard test3. Patients tested for C. difficile but with negative results or exposed to antibiotics with high or moderate risk (Appendix-1) served as controls. Nongenetic risk factors collected in this study included index age, people in healthcare settings, antibiotic treatment, underlying comorbidities such as inflammatory bowel disease (IBD), type 2 diabetic mellitus (T2DM), HIV, cancer, medications such as chemo, transplant, corticosteroid, anti-TNF, proton pump inhibitors(PPI). The observation window for each risk factor was empirically defined to avoid an uneven sampling of the disease trajectory, which might lead to temporal bias in feature selection15. Clinical risk factors and demographic information were extracted from the structured EHR based on the International Classification of Disease (ICD)-related codes and medication codes (Appendix 1).

Using data from January 1, 2009, through December 31, 2017, we identified 6,035 cases and 72,241 controls. Of these, 5911 cases and 69,086 controls had self-reported European ancestry. Overall, 22.4% (1156/14,148 for case/control) of European (EUR) patients enrolled in the MyCode project with genetic data available.

Data pre-processing

The entire cohort of participants with EUR ancestry (n = 5911/69,086) was first split based on the availability of genetic data. No missing data was observed in any of the included clinical variables. The MyCode samples were genotyped as previously reported3. Both SNPs (rs2227306 and rs4076) from IL-8 passed the quality control without missingness. The genetic variable is routinely treated in a dosage manner (0, 1, 2). Data extraction and pre-processing details (z-scored index age, binary codes for other variables) have previously been described3.

Either the χ2 test, substituted by Fisher exact test for frequency per group ≤ 5% or the ANOVA test was performed to screen for the bivariate association. A heatmap was created to show the significance of this bivariate association based on the log-transformed p-value from each bivariate χ2 test (Fig. 2).

Figure 2
figure 2

Association among features and performance of models in prediction of CDI in MyCode and nonMyCode samples with (simulated) genotypes included. (a, b) Heatmaps to show the significant association between variables employed in the prediction model using the training dataset. Data extraction and pre-processing details (z-scored index age, binary codes for other variables) have previously been described. Association among variables (index age further dummy coded) was assessed using a bivariate χ2 test. (c, d) To examine the discrimination power of each modeling algorithm in the testing dataset, we estimated the AUROC using common clinical risk factors for CDI with or without rs2227306 as predictors. Here the genotypes of rs2227306 were simulated in the nonMyCode samples. (e) The summary of AUROCs of the optimal modeling algorithms (gbm and xgbDART) versus glm using simulated rs2227306 genotype. P values represent the result of the DeLong test to compare AUROC between models with or without (simulated) genetic data included, with or without PSM for index age and sex.

Simulations to determine predictive power

A genotype simulation study was conducted to determine different modeling algorithms' predictive power and generalizability in the nonMyCode cohort. The genotypes of rs2227306 (IL-8) in this cohort of 4755 cases and 54,938 controls were created by a simulation strategy based on an assumption of a binomial probability distribution of each allele equaling to the prior parameter, MAF, estimated from the corresponding MyCode subgroups stratified by CDI, age, and sex (Table 2).

$${AlleleDosage}_{i \in R}=rbinom\left({N}_{i\in R},1, {MAF}_{i \in R}\right)+rbinom\left({N}_{i \in R},1, {MAF}_{i \in R}\right),$$

where R is a vector or a matrix of the summary statistics derived from the MyCode cohort, N and MAF represent the number of subjects and the corresponding MAF for each subcategory. This simulation strategy considers confounding factors such as age and sex, which may impact the association between the genetic variant and the outcome variable.

Individual SNP association testing

Genotype and phenotype association was conducted using Logistic Regression after controlling covariates such as age or sex in subgroups stratified by sex or age (binary).

Controlling confounding

Age and sex were identified as confounding factors for the association between genetic/nongenetic risk factors and CDI (see “Result”). They were selected as covariates in a Logistic Regression model to create propensity scores(R MatchIt package). We chose “nearest neighbor matching without replacement” to create a more balanced case:control ratio at 1:5 or 1:10, as shown in eFigure 2.

Selecting machine learning (ML) algorithms

Model development and optimization were based on selecting the optimal sampling approach followed by a comparison of the eight classification algorithms, including Logistic Regression (glm), Gradient Boosted Classification (gbm), Extreme Gradient Boosting with the dropout regularization for regression trees (xgbDART), Bagging for tree (treebag), Neural Network [nnet, using 1-layer fully connected neural networks (shallow)], C5.0 (c5), LogitBoost (lb), and Support Vector Machine (SVM). We selected these models based on model/algorithmic diversity, performance on similar tasks, model interpretability, and ease of implementation16. Specifically, these set of algorithms are well-established with good performance for tabular data in a wide range of classification tasks16. Furthermore, Ensemble methods17,18 (C5.0, xgbDART, Bagging, LogitBoost) that combine multiple weak learners to create a stronger model, can lead to improved predictive performance and better generalizability. Finally, by creating decision boundaries that are complex and not linear, Neural Network and SVM are capable of capturing non-linear relationships between the variables.

Evaluation metrics

We assessed the performance of the multivariable model in the prediction of CDI mainly by sensitivity, specificity, Positive Predictive Value (PPV), Negative Predictive Value (NPV), precision, recall, F1 score, Matthews Correlation Coefficient (MCC), accuracy, and Area Under the curve (AUROC) for a classification problem.

$${\text{F1 Score }} = { 2}*\left( {{\text{Recall }}*{\text{ Precision}}} \right) \, / \, \left( {{\text{Recall }} + {\text{ Precision}}} \right)$$
$${\text{MCC }} = \, \left( {{\text{TP}}*{\text{TN }}{-}{\text{ FP}}*{\text{FN}}} \right)/{\text{sqrt}}\left( {\left( {{\text{TP}} + {\text{FP}}} \right)\left( {{\text{TP}} + {\text{FN}}} \right)\left( {{\text{TN}} + {\text{FP}}} \right)\left( {{\text{TN}} + {\text{FN}}} \right)} \right)$$
$${\text{Accuracy }} = \, \left( {{\text{TP }} + {\text{ TN}}} \right)/\left( {{\text{TP }} + {\text{ TN }} + {\text{ FP }} + {\text{ FN}}} \right)$$

where TP = True positives; TN = True negatives; FP = False positives; FN = False negatives. Recall = TP/(TP + FN); Precision = TP/(TP + FP). Both F1 Score and MCC are calculated based on the confusion matrix and are reliable metrics for evaluating binary classification models, particularly in imbalanced datasets. F1 score combines precision and recall to provide a balanced evaluation of a model’s accuracy. MCC considers the overall balance between the four elements of the confusion matrix. The value of MCC ranges from − 1 to 1, where 1 or − 1 indicates the complete agreement or disagreement between predicted classes and actual classes, respectively, and 0 indicates completely random guessing. The result of these metrics for the training and testing datasets across eight ML algorithms was summarized in eTable3. AUROC is a graphical representation of the model’s TP rate (sensitivity/recall) versus the FP rate (1-specificity) at various probability thresholds. AUROC is insensitive to the class distribution and is often preferred over accuracy for imbalanced datasets.

Addressing data imbalance

The imbalance of a case:control dataset can lead to biased model training, where the ML models tends to favor the majority class and underestimate the minority class, resulting in high accuracy by always predicting the majority class. The oversampled data improve ML models for the prediction of minority class when there is a significant unbalanced case–control cohort, as shown in this study (~ 1:12). We performed oversampling (using Synthetic Minority Oversampling Technique, SMOTE, and random oversampling, ROSE) of the minority class during the model training to address the class imbalance and used F1 score and MCC to assess the model performance.

The SMOTE function oversampled the minority class (rare events) using bootstrapping (perc.over = 100) and k-nearest neighbor (k = 5) to synthetically create additional observations of that event and undersampling the majority class (perc.under = 200). For each case in the original dataset belonging to the minority class, perc.over/100 new examples of that class were created. ROSE applied smoothed bootstrapping to draw artificial samples from the feature space around the minority class without undersampling the majority class. We oversampled the minority class to reach a case:control ratio of 1:1 in a training dataset.

Model optimization

Both MyCode and nonMyCode samples were split into training and holdout data(testing) with 7:3 ratio at the beginning of this study. All the models were tuned in training data and tested in holdout data(testing data), as described in Fig. 1. The fivefold repeated CV has been applied to the corresponding training data from either MyCode or nonMyCode cohorts. The nonMyCode samples were considered as an additional dataset to determine the overfitting. However, the genotypes for nonMyCode samples were not available, which is very common for the non-biobank population. That is why we impute the genotype data for nonMyCode samples by simulation to determine the generalizability of the optimal models. A hyperparameter tuning grid was used to train the model with five-fold repeated cross-validation (CV) and ten repeats (R caret package). Model tuning was performed by an automatic grid search for each algorithm parameter randomly (eTable1 for the tuning parameters used in each final model). Finally, the testing set was used to calculate the model AUROC.

We also compared optimal models with the benchmark algorithm glm19,20, and extracted the feature importance of the 12 included variables. The ranks of each variable in feature importance, particularly the genetic variable, were compared across the algorithms. The DeLong test was used to compare AUROCs from two modeling algorithms and compare AUROCs of the best model with or without the genetic feature included21,22. For each model, the AUROC was calculated first, and the 95% confidence interval (95% CI) of AUROC was also computed. The covariance matrix \(Cov\left(\widehat{A}, \widehat{B}\right)\), where \(\widehat{A}\) and \(\widehat{B}\) were the estimated AUROC for models A and B, respectively, of the AUROCs between the two models was calculated. The formula for the covariance between two sample proportions, \(\widehat{A}, and \widehat{B}\), was given by: \(Cov\left(\widehat{A}, \widehat{B}\right)\) = \(\frac{(1-\widehat{A})\widehat{A}}{{n}_{a}}+ \frac{(1-\widehat{B})\widehat{B}}{{n}_{b}}+ \frac{\widehat{B}\widehat{A}}{{n}_{b}{n}_{a}} (\frac{\widehat{A}}{{n}_{a}}+ \frac{\widehat{B}}{{n}_{b}}-1)\), where \({n}_{a} and {n}_{b}\) were the sample sizes for models A and B, respectively. The Z statistic (\(\frac{\widehat{A}-\widehat{B}}{SE(\widehat{A}-\widehat{B})}\)), where \(SE(\widehat{A}-\widehat{B})= \sqrt{Cov(\widehat{A},\widehat{B}})\), and p-value were computed under an assumption of normal distribution. This p-value represented the probability of observing a difference in AUROC as extreme as the one observed in the data, assuming the null hypothesis where there was no difference between the two models.

Results

Patients demographics

The entire dataset was split based on the availability of the genetic data. Demographic and clinical information for MyCode and nonMyCode cohorts are listed in Table 1. The case:control ratio in MyCode(n = 1156/14,148, 1:11.55) was comparable to that in the nonMyCode cohort(n = 4755/54,938, 1:12.24). This ratio (~ 1:12) indicated approximately a ten-fold enrichment for controls as shown in the previously reported EHR-based large populational studies(more than 1:100). Several demographic features (e.g., sex, age) and known clinical risk factors showed significant differences between case and control groups in MyCode and nonMyCode cohort. Their bi-variate association among all predictive variables is illustrated in Fig. 2a and b. Antibiotics were the most significant risk factor for CDI.

Table 1 Demographic, potential risk factors, and genotype data for MyCode and nonMyCode participants with or without propensity score matching (PSM).

Genotype–phenotype association

The Logistic Regression analysis showed a significant association between rs2227306 genotype and CDI only in young MyCode patients (β = 0.138, p = 0.048 vs. β = 0.062, p = 0.263) after controlling for sex. After controlling for age, this nominal association was only observed in females (β = 0.119, p = 0.034 vs β = 0.053, p = 0.427). The minor alleles from both SNPs with the higher expression level of CXCL8 and CXCL6 were associated with an increased risk for CDI.

Comparing oversampling methods to manage the case–control imbalance

SMOTE outperformed ROSE in seven out of eight examined algorithms(Fig. 3). When using xgbDART and gbm models, SMOTE in the training dataset led to better F1 (0.264 and 0.272, respectively) than the ROSE (0.253 and 0.261, respectively) in the testing dataset. The process without resampling provided the worst F1 (0.037 and 0.056, respectively). SMOTE was chosen for the following analyses. Both MCC and F1 scores showed that SMOTE yielded a slightly higher value than ROSE with or without the genetic feature included according to xgbDART and gbm models in the testing dataset (eTable 2).

Figure 3
figure 3

Comparing the upsampling strategies in the MyCode testing dataset using F1 score as the metric for performance. The SMOTE function oversampled the minority class (a rare event) using bootstrapping (perc.over = 100) and k-nearest neighbor (k = 5) to synthetically create additional observations of that event and undersampling the majority class (perc.under = 200). For each case in the original dataset belonging to the minority class, perc.over/100 new examples of that class will be created. The ROSE function oversampled the minority class without undersampling the majority class. Here we make the case:control ratio in the training dataset equaled to 1:1 for both oversampling strategies. F1 score, the weighted average of Precision and Recall, was selected to determine the performance of oversampling. Summary of the sample sizes for training with or without upsampling (SMOTE or ROSE) and testing dataset stratified by genetic data availability. ROSE upsampling cases (n = 782) to 9931 so that case:control ratio is 1:1 with controls (n = 9931) for the training dataset. SMOTE upsampling cases (n = 782) to 1564 so that case:control ratio is 1:1 with controls (n = 1564) for the training dataset.

Predicting CDI in MyCode patients with or without propensity score matching (PSM)

The best tuning parameters selected for each model based on the largest ROC value from the training dataset were summarized in eTable 1. Among eight algorithms examined, we found that using the 11 clinical risk factors with rs2227306, in conjunction with gbm and xgbDART led to superior results (with the genetic feature, AUROCgbm = 0.72[0.694–0.746] versus AUROCglm = 0.684[0.655–0.713], p = 1.92e−07) (Fig. 2c). There was no significant difference between gbm and xgbDART (AUROCxgbDART = 0.715[0.689–0.742], p = 0.247). The genetic feature was always ranked higher than known risk factors in gbm and xgbDART compared to glm, suggesting non-parametric algorithms can better capture some nonlinear interaction(See the radar plots in Fig. 4). Compared to the base model, the integrated model provided the best discriminative power in the testing dataset, particularly for gbm (AUROC = 0.710 [0.683–0.737] vs. 0.72 [0.694–0.746], p = 0.006). xgbDART showed similar trends, better with the genetic feature, but did not reach a statistical significance (AUROC = 0.710 [0.683–0.736], vs. 0.715 [0.689–0.742], p = 0.304). Both gbm and xgbDART models provided a more balanced sensitivity (0.508 and 0.518 versus 0.481) and specificity (0.792 and 0.796 versus 0.777) and better PPV (0.178 and 0.183 versus 0.161) and NPV (0.948 and 0.949 versus 0.944) when compared to glm. No significant difference between age and sex was observed after the PSM (Table 1, eFigure 2). The algorithms with the best performance after matching (e.g. 1:10 ratio) remained gbm (AUROCgbm = 0.711 [0.681–0.740]) and xgbDART (AUROCxgbDART = 0.701 [0.671–0.731]) with rs2227306 (Fig. 2e). The former showed no statistically significant improvement over the model without rs2227306 (AUROCglm = 0.703 [0.675–0.732]). A similar trend was observed in the matching cohort with a 1:5 ratio (Fig. 2e). Both gbm and xgbDART models ranked the genetic feature the highest in feature importance(Fig. 4 the top two rows), while glm did not. glm also ranked index age and sex lower.

Figure 4
figure 4

Feature importance for the cohort with or without (simulated) genetic data was plotted for two selected models (gbm and xgbDART), which outperformed other models (glm and nnet). This study was based on 12 features, including one genetic risk factor, rs2227036, from IL8. Feature importance from glm and nnet was always plotted as a control to compare the rank of the features weighted by optimal modeling algorithms (gbm and xgbDART) in MyCode (top two rows) and nonMyCode samples (bottom two rows). The genetic feature was weighted the top tier in gbm and xgbDART but not in glm and nnet irrespective of PSM in the MyCode cohort.

Predictive power in nonMyCode patients

The simulated genotype in the nonMyCode cohort can recapitulate the nominal association between rs2227306 and CDI (Table 2). Consistent with the result from MyCode patients, with the simulated genetic feature included, gbm (AUROCgbm = 0.820 [0.809–0.831]) and xgbDART (AUROCxgbDART = 0.819 [0.808–0.831]) outperformed other modeling algorithms (e.g., AUROCglm = 0.751 [0.737–0.765], p = 1.68e−57) (Fig. 2d). Again, rs2227306 was ranked higher in gbm and xgbDARTthan glm (Fig. 4 the bottom two rows). The result of performance metrics (sensitivity, specificity, PPV, NPV, precision, recall, F1 score, and accuracy) for the prediction of CDI in the training and testing datasets across eight ML algorithms using SMOTE as the oversampling method was summarized in eTable3.

Table 2 Two simulation strategies to create the genetic data of rs2227306(IL8) in the nonMyCode cohort.

Discussion

We developed a prediction model of symptomatic CDI by integrating common risk factors extracted from electronic health records and genetic risk factors (rs2227306/IL8). Our modeling pipeline included steps to minimize systemic bias in the final models while adhering to best practices to improve model transparency. These steps included (1) applying robust and validated phenotyping, (2) selecting and optimizing a range of ML algorithms with a focus on attributes such as generalizability, interpretability, potential interactions, and bias-variance trade-off, (3) addressing data imbalance and performing extensive simulation studies to determine the predictive power in samples without genetic information, and finally using PSM to control for confounding factors.

Overall, our results supported that decision tree-based Ensemble methods such as gbm and xgbDART demonstrated superior discriminative power than glm(logistic regression)16. Both gbm and xgbDART are based on the boosting algorithm17,18 which is an ensemble technique that sequentially builds multiple weak learners, each focusing on the mistakes of its predecessor and assigning higher weights to misclassified instances to correct them in subsequent iterations. Ensemble methods often work better than individual ML algorithms due to the following reasons17,18: (1) Handing imbalanced data by giving more weight to the minority class and balancing the predictions as shown in this study; (2) Reduced Bias and Variance by combing multiple individual models(weak learners) to create a single and more robust model(strong learners); (3) Complementary learning by leveraging the complementary strengths of individual algorithms with less overfitting and improved generalization; (4) Tackling complex relationships which cannot be effectively captured by a single model; (5) Robustness to noisy data because of the averaging or voting mechanisms which often are less affected by noisy data points and outliers.

The improvements gained by including common genetic variants in the optimal models were limited and age- or sex-dependent after PSM. This finding was consistent with the decreased genetic heritability observed in late-onset compared to early-onset in multiple complex disease traits23, and rs2227306 (IL-8) was related to early-onset of disease24.

Importance of genetic risk factor across modeling algorithms

Feature importance was considered a measure of the individual contribution of the feature for a particular classifier, regardless of the shape or direction of the feature effect. The genetic feature was consistently ranked high in the optimal models, even higher in PSM subgroups. This finding corroborates the clinical value of genetic information in prediction models. The conclusion made from the MyCode sample could partially be extended to the larger nonMyCode sample. The potential interaction between the genetic feature and clinical risk factors beyond age and sex was not recognized and implemented in the genotype simulation, which may prevent the conclusion made from the MyCode sample could fully be extended to the larger nonMyCode sample, leading to more uncertainty of the discriminative power in the model with the simulated genetic feature included. The importance of the genetic feature was ranked lowest in the Logistic Regression-based model, suggesting Logistic Regression may underestimate the contribution of the genetic variant for the prediction of CDI, highlighting the importance of capturing multi-way interactions when assessing the value of common genetic variants with a small effect size in prediction models25.

Clinical relevance of the selected genetic variant from IL-8

Previous candidate gene approach of genetic predisposition of CDI has revealed a promoter polymorphism –rs4073(–251T>A) and its linkage disequilibrium SNP, rs2227306(+ 781T/C), from a pro-inflammatory cytokine, IL-8—can result in increased IL-8 production and predisposes subjects to CDI6, recurrent CDI8, or severity of CDI7. They are both eQTL for a gene cluster located at 4q13.3 (eFigure 1), encoding several members of the CXC chemokine family such as CXCL8 (aka IL-8), CXCL6, CXCL5, CXCL1, and more, which promote the recruitment of neutrophils to the site of infection.

Minor alleles from both SNPs are associated with an increased expression level of CXCL8 (GTEx). Increased IL-8 protein levels and CXCL5 and IL8 message levels have been associated with prolonged disease26. Intrarectal administration of TcdA/TcdB in a mouse model increases the expression of inflammatory mediators such as CXCL1, the murine ortholog of human IL-827. PheWAS analyses from UKBIOBANK (eFigure 1) confirmed that the top phenotypes associated with rs2227306 included some laboratory variables such as “Neutrophil count”, “Neutrophil percentage”, “White blood cell count”, the latter of which contributed to a composite risk score developed in earlier studies in the prediction of CDI severity28,29 (Table 3). Our study is the first to evaluate the integrated model by including the genetic and common clinical risk factors using various optimized ML algorithms to predict CDI.

Table 3 Summary of the reported clinical decision tools to predict Clostridioides difficile infection (CDI), severity, or recurrence.

In the context of other similar studies

Since the testing and treatment are not recommended in asymptomatic carriers of C. difficile1, we direct our focus on symptomatic CDI. The results from this study may eventually facilitate at-risk patient stratification for targeted treatment in patients more likely to benefit from emerging prevention or treatment options such as a vaccine, fidaxomicin, monoclonal antibodies, and fecal microbiota transplantation. Results may also support more granular substratification for therapeutic trials.

Controls were defined as patients without CDI, based on negative molecular laboratory results or exposed to similar risk factors, such as antibiotic use. Because of using the eMERGE phenotype algorithm with the inclusion/exclusion criteria to define controls, our case:control ratio was approximately 1:10, which was a tenfold enrichment of controls with increased risk for CDI, compared to more than 1:100 ratio summarized from other retrospective cohorts studies (Table 3). This enrichment would make the prediction more challenging.

The results from a subgroup with PSM suggested that the contribution of the genetic variant in model prediction was minimum in elderly patients. This finding was consistent with the decreased genetic heritability observed in late-onset compared to early-onset in multiple complex disease traits23, and rs2227306 (IL-8) was related to early-onset of disease24.

As summarized in Table 3, the majority of the predictive models for CDI30,31,32,33,34,35,36 are not based on large EHR or claims databases until recently37,38,39,40,41,42,43, whereas studies performed on recurrence44,45,46,47 or severity28,29,48,49 include very small cohorts. Further, the existing studies do not compare algorithms; instead, they focus on the amount of information extracted necessary for improved prediction. In general, the amount of information, such as the number of variables, correlates with model performance; however, including hundreds of variables can lead to models with lower interpretability and reduced generalizability to other healthcare systems. For example, some institution-specific features can rank in the top tier in feature importance; therefore, a healthcare-based model may have limited discriminative power in the prediction of individuals from other healthcare systems when geographic, social-economic, and clinical management environments differ significantly.

The generalizability of developed prediction models from a single healthcare system to others is debatable37. Since this study aims to utilize only common clinical risk factors readily available in most EHRs to build a prediction model, the conclusion made from this study could have better generalizability and may be easier to implement elsewhere. For these reasons, we propose that this integrated model is more transferable to EHR than complex models with manually curated variables and datasets.

Strength and limitation

The strength of this study lies in the following, (1) development of a prediction model of symptomatic CDI by integration of genetic and common clinical risk factors; (2) evaluation of several advanced ML algorithms to compare their performance; (3) identification of an association between the genetic variant and the outcome variable, which was confounded by age and sex; (4) determination of the value of the genetic feature in its contribution to the model performance, in general, and propensity score matched subgroups; and (5) identification of the selection bias in the cohort with genetic data available.

This study has some limitations, including (1) the accuracy of EHR data collection and recording processes that may vary by clinician, hospital, and over time to possibly prevent the generalizability of developed models to other healthcare systems; (2) our data came from a single healthcare system with a patient population that was predominantly European ancestry. The features selected from this homogenous population may not best represent or cover the complexity of feature space derived from a heterogenous population; (3) we only tested a common genetic variant with a high MAF. We expect the polygenic risk score developed from the consortium-based GWAS with individual effect size estimated from thousands of genetic variants would better represent the genetic liability to CDI and other complex diseases.

In conclusion, we showed that developing robust prediction models for CDI, and perhaps other complex conditions, requires a step-wise approach to ensure the highest level of transparency and lowest possible systemic bias. This study leveraged CDI as a disease model to demonstrate that although genetic information may improve predictions, the benefit of including genetic feature(s) in the prediction models should be thoroughly evaluated.