An integrated pipeline for prediction of Clostridioides difficile infection

With the expansion of electronic health records(EHR)-linked genomic data comes the development of machine learning-enable models. There is a pressing need to develop robust pipelines to evaluate the performance of integrated models and minimize systemic bias. We developed a prediction model of symptomatic Clostridioides difficile infection(CDI) by integrating common EHR-based and genetic risk factors(rs2227306/IL8). Our pipeline includes (1) leveraging phenotyping algorithm to minimize temporal bias, (2) performing simulation studies to determine the predictive power in samples without genetic information, (3) propensity score matching to control for the confoundings, (4) selecting machine learning algorithms to capture complex feature interactions, (5) performing oversampling to address data imbalance, and (6) optimizing models and ensuring proper bias-variance trade-off. We evaluate the performance of prediction models of CDI when including common clinical risk factors and the benefit of incorporating genetic feature(s) into the models. We emphasize the importance of building a robust integrated pipeline to avoid systemic bias and thoroughly evaluating genetic features when integrated into the prediction models in the general population and subgroups.


Robust phenotyping algorithm
The phenotype algorithm used to identify CDI cases and controls from EHR data was developed by eMERGE entitled "Phenotype Algorithm Pseudo Code (August 16, 2012)" and collected at PhenoKB (https:// phekb.org/ pheno type/ clost ridium-diffi cile-colit is).This algorithm adopted the golden standard, which used laboratory test data to identify CDI cases.Based on clinical symptoms, we first identified adults (age ≥ 18) with CDI from the Geisinger EHR.Three or more consecutive liquid stools within a day may be tested for C. difficile based on recommended guidelines 13,14 .The polymerase chain reaction was used as the laboratory reference standard test 3 .Patients tested for C. difficile but with negative results or exposed to antibiotics with high or moderate risk (Appendix-1) served as controls.Nongenetic risk factors collected in this study included index age, people in healthcare settings, antibiotic treatment, underlying comorbidities such as inflammatory bowel disease (IBD), type 2 diabetic mellitus (T2DM), HIV, cancer, medications such as chemo, transplant, corticosteroid, anti-TNF, proton pump inhibitors(PPI).The observation window for each risk factor was empirically defined to avoid an uneven sampling of the disease trajectory, which might lead to temporal bias in feature selection 15 .Clinical risk factors and demographic information were extracted from the structured EHR based on the International Classification of Disease (ICD)-related codes and medication codes (Appendix 1).
Using data from January 1, 2009, through December 31, 2017, we identified 6,035 cases and 72,241 controls.Of these, 5911 cases and 69,086 controls had self-reported European ancestry.Overall, 22.4% (1156/14,148 for case/control) of European (EUR) patients enrolled in the MyCode project with genetic data available.

Data pre-processing
The entire cohort of participants with EUR ancestry (n = 5911/69,086) was first split based on the availability of genetic data.No missing data was observed in any of the included clinical variables.The MyCode samples were genotyped as previously reported 3 .Both SNPs (rs2227306 and rs4076) from IL-8 passed the quality control without missingness.The genetic variable is routinely treated in a dosage manner (0, 1, 2).Data extraction and pre-processing details (z-scored index age, binary codes for other variables) have previously been described 3 .
Either the χ 2 test, substituted by Fisher exact test for frequency per group ≤ 5% or the ANOVA test was performed to screen for the bivariate association.A heatmap was created to show the significance of this bivariate association based on the log-transformed p-value from each bivariate χ 2 test (Fig. 2).

Simulations to determine predictive power
A genotype simulation study was conducted to determine different modeling algorithms' predictive power and generalizability in the nonMyCode cohort.The genotypes of rs2227306 (IL-8) in this cohort of 4755 cases and 54,938 controls were created by a simulation strategy based on an assumption of a binomial probability distribution of each allele equaling to the prior parameter, MAF, estimated from the corresponding MyCode subgroups stratified by CDI, age, and sex (Table 2).
where R is a vector or a matrix of the summary statistics derived from the MyCode cohort, N and MAF represent the number of subjects and the corresponding MAF for each subcategory.This simulation strategy considers confounding factors such as age and sex, which may impact the association between the genetic variant and the outcome variable.

Individual SNP association testing
Genotype and phenotype association was conducted using Logistic Regression after controlling covariates such as age or sex in subgroups stratified by sex or age (binary).

Controlling confounding
Age and sex were identified as confounding factors for the association between genetic/nongenetic risk factors and CDI (see "Result").They were selected as covariates in a Logistic Regression model to create propensity scores(R MatchIt package).We chose "nearest neighbor matching without replacement" to create a more balanced case:control ratio at 1:5 or 1:10, as shown in eFigure 2.  www.nature.com/scientificreports/

Selecting machine learning (ML) algorithms
Model development and optimization were based on selecting the optimal sampling approach followed by a comparison of the eight classification algorithms, including Logistic Regression (glm), Gradient Boosted Classification (gbm), Extreme Gradient Boosting with the dropout regularization for regression trees (xgbDART), Bagging for tree (treebag), Neural Network [nnet, using 1-layer fully connected neural networks (shallow)], C5.0 (c5), LogitBoost (lb), and Support Vector Machine (SVM).We selected these models based on model/algorithmic diversity, performance on similar tasks, model interpretability, and ease of implementation 16 .Specifically, these set of algorithms are well-established with good performance for tabular data in a wide range of classification tasks 16 .Furthermore, Ensemble methods 17,18 (C5.0, xgbDART, Bagging, LogitBoost) that combine multiple weak learners to create a stronger model, can lead to improved predictive performance and better generalizability.Finally, by creating decision boundaries that are complex and not linear, Neural Network and SVM are capable of capturing non-linear relationships between the variables.

Evaluation metrics
We assessed the performance of the multivariable model in the prediction of CDI mainly by

Addressing data imbalance
The imbalance of a case:control dataset can lead to biased model training, where the ML models tends to favor the majority class and underestimate the minority class, resulting in high accuracy by always predicting the majority class.The oversampled data improve ML models for the prediction of minority class when there is a significant unbalanced case-control cohort, as shown in this study (~ 1:12).We performed oversampling (using Synthetic Minority Oversampling Technique, SMOTE, and random oversampling, ROSE) of the minority class during the model training to address the class imbalance and used F1 score and MCC to assess the model performance.
The SMOTE function oversampled the minority class (rare events) using bootstrapping (perc.over= 100) and k-nearest neighbor (k = 5) to synthetically create additional observations of that event and undersampling the majority class (perc.under= 200).For each case in the original dataset belonging to the minority class, perc.over/100 new examples of that class were created.ROSE applied smoothed bootstrapping to draw artificial samples from the feature space around the minority class without undersampling the majority class.We oversampled the minority class to reach a case:control ratio of 1:1 in a training dataset.

Model optimization
Both MyCode and nonMyCode samples were split into training and holdout data(testing) with 7:3 ratio at the beginning of this study.All the models were tuned in training data and tested in holdout data(testing data), as described in Fig. 1.The fivefold repeated CV has been applied to the corresponding training data from either MyCode or nonMyCode cohorts.The nonMyCode samples were considered as an additional dataset to determine the overfitting.However, the genotypes for nonMyCode samples were not available, which is very common for the non-biobank population.That is why we impute the genotype data for nonMyCode samples by simulation to determine the generalizability of the optimal models.A hyperparameter tuning grid was used to train the model with five-fold repeated cross-validation (CV) and ten repeats (R caret package).Model tuning was performed by an automatic grid search for each algorithm parameter randomly (eTable1 for the tuning parameters used in each final model).Finally, the testing set was used to calculate the model AUROC.
We also compared optimal models with the benchmark algorithm glm 19,20 , and extracted the feature importance of the 12 included variables.The ranks of each variable in feature importance, particularly the genetic variable, were compared across the algorithms.The DeLong test was used to compare AUROCs from two modeling algorithms and compare AUROCs of the best model with or without the genetic feature included 21,22 .For each model, the AUROC was calculated first, and the 95% confidence interval (95% CI) of AUROC was also computed.The covariance matrix Cov A, B , where A and B were the estimated AUROC for models A and B, respectively, of the AUROCs between the two models was calculated.The formula for the covariance between two sample www.nature.com/scientificreports/proportions, A, and B , was given by: , where n a andn b were the sample sizes for models A and B, respectively.The Z statistic ( A− B

SE( A− B)
), where SE( A − B) = Cov( A, B) , and p-value were computed under an assumption of normal distribution.This p-value represented the probability of observing a difference in AUROC as extreme as the one observed in the data, assuming the null hypothesis where there was no difference between the two models.

Patients demographics
The entire dataset was split based on the availability of the genetic data.Demographic and clinical information for MyCode and nonMyCode cohorts are listed in Table 1.The case:control ratio in MyCode(n = 1156/14,148, 1:11.55) was comparable to that in the nonMyCode cohort(n = 4755/54,938, 1:12.24).This ratio (~ 1:12) indicated approximately a ten-fold enrichment for controls as shown in the previously reported EHR-based large populational studies(more than 1:100).Several demographic features (e.g., sex, age) and known clinical risk factors showed significant differences between case and control groups in MyCode and nonMyCode cohort.Their bi-variate association among all predictive variables is illustrated in Fig. 2a and b.Antibiotics were the most significant risk factor for CDI.

Genotype-phenotype association
The Logistic Regression analysis showed a significant association between rs2227306 genotype and CDI only in young MyCode patients (β = 0.138, p = 0.048 vs. β = 0.062, p = 0.263) after controlling for sex.After controlling for age, this nominal association was only observed in females (β = 0.119, p = 0.034 vs β = 0.053, p = 0.427).The minor alleles from both SNPs with the higher expression level of CXCL8 and CXCL6 were associated with an increased risk for CDI.

Comparing oversampling methods to manage the case-control imbalance
SMOTE outperformed ROSE in seven out of eight examined algorithms(Fig.3).When using xgbDART and gbm models, SMOTE in the training dataset led to better F1 (0.264 and 0.272, respectively) than the ROSE (0.253 and 0.261, respectively) in the testing dataset.The process without resampling provided the worst F1 (0.037 and 0.056, respectively).SMOTE was chosen for the following analyses.Both MCC and F1 scores showed that SMOTE yielded a slightly higher value than ROSE with or without the genetic feature included according to xgbDART and gbm models in the testing dataset (eTable 2).

Discussion
We developed a prediction model of symptomatic CDI by integrating common risk factors extracted from electronic health records and genetic risk factors (rs2227306/IL8).Our modeling pipeline included steps to minimize systemic bias in the final models while adhering to best practices to improve model transparency.These steps included (1) applying robust and validated phenotyping, (2) selecting and optimizing a range of ML algorithms with a focus on attributes such as generalizability, interpretability, potential interactions, and bias-variance trade-off, (3) addressing data imbalance and performing extensive simulation studies to determine the predictive power in samples without genetic information, and finally using PSM to control for confounding factors.Overall, our results supported that decision tree-based Ensemble methods such as gbm and xgbDART demonstrated superior discriminative power than glm(logistic regression) 16 .Both gbm and xgbDART are based on the boosting algorithm 17,18 which is an ensemble technique that sequentially builds multiple weak learners, each focusing on the mistakes of its predecessor and assigning higher weights to misclassified instances to correct them in subsequent iterations.Ensemble methods often work better than individual ML algorithms due to the following reasons 17,18  The improvements gained by including common genetic variants in the optimal models were limited and age-or sex-dependent after PSM.This finding was consistent with the decreased genetic heritability observed   in late-onset compared to early-onset in multiple complex disease traits 23 , and rs2227306 (IL-8) was related to early-onset of disease 24 .

Importance of genetic risk factor across modeling algorithms
Feature importance was considered a measure of the individual contribution of the feature for a particular classifier, regardless of the shape or direction of the feature effect.The genetic feature was consistently ranked high in the optimal models, even higher in PSM subgroups.This finding corroborates the clinical value of genetic information in prediction models.The conclusion made from the MyCode sample could partially be extended to the larger nonMyCode sample.The potential interaction between the genetic feature and clinical risk factors beyond age and sex was not recognized and implemented in the genotype simulation, which may prevent the conclusion made from the MyCode sample could fully be extended to the larger nonMyCode sample, leading to more uncertainty of the discriminative power in the model with the simulated genetic feature included.The importance of the genetic feature was ranked lowest in the Logistic Regression-based model, suggesting Logistic Regression may underestimate the contribution of the genetic variant for the prediction of CDI, highlighting the importance of capturing multi-way interactions when assessing the value of common genetic variants with a small effect size in prediction models 25 .

Clinical relevance of the selected genetic variant from IL-8
Previous candidate gene approach of genetic predisposition of CDI has revealed a promoter polymorphism -rs4073(-251T>A) and its linkage disequilibrium SNP, rs2227306(+ 781T/C), from a pro-inflammatory cytokine, IL-8-can result in increased IL-8 production and predisposes subjects to CDI 6 , recurrent CDI 8 , or severity of CDI 7 .They are both eQTL for a gene cluster located at 4q13.3 (eFigure 1), encoding several members of the CXC chemokine family such as CXCL8 (aka IL-8), CXCL6, CXCL5, CXCL1, and more, which promote the recruitment of neutrophils to the site of infection.Minor alleles from both SNPs are associated with an increased expression level of CXCL8 (GTEx).Increased IL-8 protein levels and CXCL5 and IL8 message levels have been associated with prolonged disease 26 .Intrarectal administration of TcdA/TcdB in a mouse model increases the expression of inflammatory mediators such as CXCL1, the murine ortholog of human IL-8 27 .PheWAS analyses from UKBIOBANK (eFigure 1) confirmed that the top phenotypes associated with rs2227306 included some laboratory variables such as "Neutrophil count", "Neutrophil percentage", "White blood cell count", the latter of which contributed to a composite risk score developed in earlier studies in the prediction of CDI severity 28,29 (Table 3).Our study is the first to evaluate the integrated model by including the genetic and common clinical risk factors using various optimized ML algorithms to predict CDI.

In the context of other similar studies
Since the testing and treatment are not recommended in asymptomatic carriers of C. difficile 1 , we direct our focus on symptomatic CDI.The results from this study may eventually facilitate at-risk patient stratification for targeted treatment in patients more likely to benefit from emerging prevention or treatment options such as a vaccine, fidaxomicin, monoclonal antibodies, and fecal microbiota transplantation.Results may also support more granular substratification for therapeutic trials.
Controls were defined as patients without CDI, based on negative molecular laboratory results or exposed to similar risk factors, such as antibiotic use.Because of using the eMERGE phenotype algorithm with the inclusion/exclusion criteria to define controls, our case:control ratio was approximately 1:10, which was a tenfold enrichment of controls with increased risk for CDI, compared to more than 1:100 ratio summarized from other retrospective cohorts studies (Table 3).This enrichment would make the prediction more challenging.www.nature.com/scientificreports/ The results from a subgroup with PSM suggested that the contribution of the genetic variant in model prediction was minimum in elderly patients.This finding was consistent with the decreased genetic heritability observed in late-onset compared to early-onset in multiple complex disease traits 23 , and rs2227306 (IL-8) was related to early-onset of disease 24 .
As summarized in Table 3, the majority of the predictive models for CDI [30][31][32][33][34][35][36] are not based on large EHR or claims databases until recently [37][38][39][40][41][42][43] , whereas studies performed on recurrence [44][45][46][47] or severity 28,29,48,49 include very small cohorts.Further, the existing studies do not compare algorithms; instead, they focus on the amount of information extracted necessary for improved prediction.In general, the amount of information, such as the number of variables, correlates with model performance; however, including hundreds of variables can lead to models with lower interpretability and reduced generalizability to other healthcare systems.For example, some institution-specific features can rank in the top tier in feature importance; therefore, a healthcare-based model may have limited discriminative power in the prediction of individuals from other healthcare systems when geographic, social-economic, and clinical management environments differ significantly.
The generalizability of developed prediction models from a single healthcare system to others is debatable 37 .Since this study aims to utilize only common clinical risk factors readily available in most EHRs to build a prediction model, the conclusion made from this study could have better generalizability and may be easier to implement elsewhere.For these reasons, we propose that this integrated model is more transferable to EHR than complex models with manually curated variables and datasets.

Strength and limitation
The strength of this study lies in the following, (1) development of a prediction model of symptomatic CDI by integration of genetic and common clinical risk factors; (2) evaluation of several advanced ML algorithms to compare their performance; (3) identification of an association between the genetic variant and the outcome variable, which was confounded by age and sex; (4) determination of the value of the genetic feature in its contribution to the model performance, in general, and propensity score matched subgroups; and (5) identification of the selection bias in the cohort with genetic data available.
This study has some limitations, including (1) the accuracy of EHR data collection and recording processes that may vary by clinician, hospital, and over time to possibly prevent the generalizability of developed models to other healthcare systems; (2) our data came from a single healthcare system with a patient population that was predominantly European ancestry.The features selected from this homogenous population may not best represent or cover the complexity of feature space derived from a heterogenous population; (3) we only tested a common genetic variant with a high MAF.We expect the polygenic risk score developed from the consortium-based ) AND ("prediction"[Mesh] OR "machine learning").We focused on human subject studies.Any review articles and studies focusing on metagenomics or microbiome data were excluded.The PubMed database was searched, and studies available between January 1, 1990, and May 31, 2021, were included.We also checked the reference from each included study, and additional studies that were missed during the initial search were appended.A total of 23 original articles were included.
GWAS with individual effect size estimated from thousands of genetic variants would better represent the genetic liability to CDI and other complex diseases.
In conclusion, we showed that developing robust prediction models for CDI, and perhaps other complex conditions, requires a step-wise approach to ensure the highest level of transparency and lowest possible systemic bias.This study leveraged CDI as a disease model to demonstrate that although genetic information may improve predictions, the benefit of including genetic feature(s) in the prediction models should be thoroughly evaluated.

Figure 1 .Figure 2 .
Figure 1.A flowchart illustrated the sample size and the pipeline for the prediction model development.

Figure 3 .
Figure 3. Comparing the upsampling strategies in the MyCode testing dataset using F1 score as the metric for performance.The SMOTE function oversampled the minority class (a rare event) using bootstrapping (perc.over= 100) and k-nearest neighbor (k = 5) to synthetically create additional observations of that event and undersampling the majority class (perc.under= 200).For each case in the original dataset belonging to the minority class, perc.over/100new examples of that class will be created.The ROSE function oversampled the minority class without undersampling the majority class.Here we make the case:control ratio in the training dataset equaled to 1:1 for both oversampling strategies.F1 score, the weighted average of Precision and Recall, was selected to determine the performance of oversampling.Summary of the sample sizes for training with or without upsampling (SMOTE or ROSE) and testing dataset stratified by genetic data availability.ROSE upsampling cases (n = 782) to 9931 so that case:control ratio is 1:1 with controls (n = 9931) for the training dataset.SMOTE upsampling cases (n = 782) to 1564 so that case:control ratio is 1:1 with controls (n = 1564) for the training dataset.

Figure 4 .
Figure 4.Feature importance for the cohort with or without (simulated) genetic data was plotted for two selected models (gbm and xgbDART), which outperformed other models (glm and nnet).This study was based on 12 features, including one genetic risk factor, rs2227036, from IL8. Feature importance from glm and nnet was always plotted as a control to compare the rank of the features weighted by optimal modeling algorithms (gbm and xgbDART) in MyCode (top two rows) and nonMyCode samples (bottom two rows).The genetic feature was weighted the top tier in gbm and xgbDART but not in glm and nnet irrespective of PSM in the MyCode cohort.

Table 1 .
Demographic, potential risk factors, and genotype data for MyCode and nonMyCode participants with or without propensity score matching (PSM).Data were presented as the number of subjects with frequency in parentheses or Mean ± SD.All patients were subjected to a molecular test for the detection of C. difficile.A "*" represented statistics from the Chi-square test to determine whether there is a significant difference between the expected frequency and observed frequency in one or more categories or statistics from the ANOVA test to determine whether there is a significant difference between group means.IBD Inflammatory Bowel Disease, chemo Chemotherapy, PPI proton pump inhibitor medications, steroid Corticosteroid medications, TNF Anti-TNF medications, transplant Transplant medications, T2DM Type 2 Diabetes, MHC major histocompatibility complex, w with, w/o without.

Table 2 .
Two simulation strategies to create the genetic data of rs2227306(IL8) in the nonMyCode cohort.Significant values are in italics.This table summarizes the sample size and minor allele frequency(MAF) of each subgroup stratified by the outcome variable (CDI), age, and sex in both MyCode and nonMyCode cohorts.The genotype of rs2227306 was simulated based on an assumption of binomial distribution of each allele with frequency equaling the prior parameter (MAF) estimated from the corresponding MyCode samples.The genotype for each subject would be a combination of sampling from two binomial distributions with the same MAF.

Table 3 .
Summary of the reported clinical decision tools to predict Clostridioides difficile infection (CDI), severity, or recurrence.We conducted a comprehensive search on PubMed and Web of Science by combining two major themes of Clostridioides difficile and prediction.The search strings for prediction of Clostridioides difficile infection were: ("clostridioides difficile "[Mesh] OR "clostridium difficile"[Mesh]