Machine learning using clinical data at baseline predicts the efficacy of vedolizumab at week 22 in patients with ulcerative colitis

Predicting the response of patients with ulcerative colitis (UC) to a biologic such as vedolizumab (VDZ) before administration is an unmet need for optimizing individual patient treatment. We hypothesized that the machine-learning approach with daily clinical information can be a new, promising strategy for developing a drug-efficacy prediction tool. Random forest with grid search and cross-validation was employed in Cohort 1 to determine the contribution of clinical features at baseline (week 0) to steroid-free clinical remission (SFCR) with VDZ at week 22. Among 49 clinical features including sex, age, height, body weight, BMI, disease duration/phenotype, treatment history, clinical activity, endoscopic activity, and blood test items, the top eight features (partial Mayo score, MCH, BMI, BUN, concomitant use of AZA, lymphocyte fraction, height, and CRP) were selected for logistic regression to develop a prediction model for SFCR at week 22. In the validation using the external Cohort 2, the positive and negative predictive values of the prediction model were 54.5% and 92.3%, respectively. The prediction tool appeared useful for identifying patients with UC who would not achieve SFCR at week 22 during VDZ therapy. This study provides a proof-of-concept that machine learning using real-world data could permit personalized treatment for UC.

Ulcerative colitis (UC) is one of the major phenotypes of inflammatory bowel disease (IBD), and it is characterized by chronic colonic inflammation with periods of remission and relapse. Although the pathophysiology of ulcerative UC remains unclear, more patients has been able to achieve remission with the improvement of therapeutic options and strategies, which has led to better long-term prognosis [1][2][3][4] . At present, various molecular targeted drugs, such as calcineurin inhibitor [cyclosporine A and tacrolimus (TAC)], anti-tumor necrosis factor alpha (TNFα) antibodies [adalimumab (ADA), golimumab, and infliximab (IFX)], anti-α 4 β 7 integrin antibody [vedolizumab (VDZ)], anti-IL12/23p40 antibody [ustekinumab (UST)], and Janus kinase (JAK) inhibitor [tofacitinib (TOF)], are particularly used for treating patients with steroid-dependent/refractory UC. Meanwhile, in most clinical settings, it is challenging for physicians to identify the most effective molecular targeted drug for individual patients. When a physician considers starting a molecular targeted medication, the patient must have active disease that requires additional therapeutic intervention, that is, appropriate selection of a medication without delay is expected. In general, there is no guide for selecting the most suitable molecular targeted drug for the individual patient at present. This lack of a guide affects both patient outcomes and medical costs. Molecular targeted drugs are far more expensive than conventional medications such as 5-aminosalicylic acid (5-ASA), immunomodulators [e.g., azathioprine (AZA)], and steroids. The use of ineffective molecular targeted medications can represent a socioeconomic burden. Thus, predicting the efficacy of a molecular targeted medication before administration is crucial in this molecular-targeted therapy era. The real-world pooled outcome of VDZ demonstrated that the rates of clinical response and remission at week 14 were 51% and 31%, respectively 5 . Given these rates, other medications might be more effective than VDZ for some patients with UC, and the prediction of VDZ efficacy in advance could provide these patients with an opportunity to initially receive another therapy. Several studies investigated the predictors of response to VDZ in UC 6,7 . Among clinical factors at baseline, serum C-reactive protein (CRP) levels 8,9 , serum albumin concentrations 7 , the Mayo Clinic score 9 , previous exposure to anti-TNF agents 7,10 , disease duration 7 , and endoscopic activity 7 have been reported to be associated with the clinical efficacy of VDZ in patients with UC. These previous studies employed statistical methods such as univariate and multivariate analyses to search for the predictors. In this study, we hypothesized that a new approach using machine learning could illuminate predictive factors of VDZ efficacy for UC that have not been detected as statistically significant using the conventional statistical approaches. In the present study, we investigated clinical features at baseline (week 0) that affect steroid-free clinical remission (SFCR) during VDZ therapy at week 22 and developed a prediction tool. Random forest (RF) 11 is an ensemble learning algorithm generating decision trees based on the training data. RF can also estimate the relative importance score for each feature. That is, RF allows the analysis of many factors simultaneously and provides insights into the contribution of each factor to the eventual outcome (i.e., achievement vs. no achievement of SFCR at week 22). We employed this method for clinical data at week 0 for patients with UC who started VDZ treatment for the induction of remission (training cohort), and the extracted factors were used to develop a prediction tool. The predictive accuracy of the tool was evaluated with another data set of patients who received VDZ for UC (test cohort).
The merit of this study is attempting to establish a prediction model based on generally available clinical information that was collected in daily practice. This is crucial for applying a machine learning-based prediction tool to the clinical setting. This pioneering work provides a proof-of-concept that the machine-leaning approach can be a new strategy for investigating predictors of the treatment efficacy in patients with UC and developing a prediction tool.

Methods
Study subjects. We retrospectively collected clinical data at baseline (week 0) and examined the clinical activity of UC at week 22 in 34 patients who (1) started VDZ at Kyorin University Hospital between September 2019 and April 2020 for the induction of remission, (2) underwent blood testing at week 0, and (3) underwent examination at Kyorin University Hospital at week 22 (training cohort, Cohort 1). As an extra-facility cohort, 35 patients with UC at Toho University Sakura Medical Center who (1) started VDZ between January 2019 and June 2020 for the induction of remission, (2) underwent blood testing at week 0, and (3) underwent examination at Toho University Sakura Medical Center at week 22 were analyzed (Cohort 2). The diagnosis of UC was confirmed using the clinical practice guidelines for IBD of The Japanese Society of Gastroenterology 12 . VDZ treatment for the induction of remission was defined as VDZ started for active UC (Lichtiger index 13 was ≥ 5). Clinical Table 1). The blood test was performed on the day of the first VDZ dose. Colonoscopy performed within 3 months before starting VDZ therapy was employed to obtain the baseline endoscopic findings. Categorical data were replaced with dummy variables. Missing values were imputed with the average value and the mode value for numerical data and categorical data, respectively. The data of patients in Cohort 2 were similarly collected from the Toho Sakura Medical Center medical record system. The standardized values of Cohort 1 were used for RF. RF was employed to develop a high-accuracy prediction model and identify which feature contributed to the prediction in the present study. RF is an ensemble technique using decision trees. In training, the RF algorithm creates multiple trees, and each tree is trained on the bootstrapped samples of the training data. Since the number of patients was limited in this study, RF was initialized using random values, and the training of RF was repeated 50 times. The contribution of each feature (49 clinical features, Table 1) to SFCR at week 22 was obtained by calculating the average value. When training the RF, the hyperparameters (number of trees and maximum depth of the tree) were automatically optimized via grid search and cross-validation. Grid search is a method for obtaining optimal hyperparameters in an algorithm. This performs a complete search over a given subset of the hyperparameter space of the training algorithm. The best hyperparameters are estimated according to the evaluation score of the validation data. Cross-validation is a resampling procedure for evaluating machine-learning models on a limited data sample. The general procedure is as follows: (1) split the dataset into k groups; (2) for each group, (i) select a group as a validation dataset, (ii) use the remaining groups ("k − 1" groups) as a training dataset, and (iii) fit a model on the training set and evaluate it on the validation set; and (3) calculate an average of k evaluation score. The final prediction result is obtained from the mode of predictions obtained from individual decision trees. The feature importance is determined according to the extent a decision tree node using each feature can reduce impurity across all trees in the forest. Next, logistic regression was used to develop a prediction tool in this study. Logistic regression is a classification algorithm for assigning each observation to a discrete set of classes. We inputted eight clinical features at week 0 that were selected as features with high contributions based on RF findings to predict the achievement/no achievement of SFCR at week 22. Logistic regression finally outputted the probability of which an observation vector belongs to a particular class using the logistic sigmoid function. The prediction accuracy of the model was assessed using the data of Cohort 2. We performed the machine learning in python and used the scikit-learn package. Informed consent was obtained from subjects (also from a parent when a patient was younger than 18 years) prior to the study.
Colonoscopy was performed at baseline in 31 patients. Endoscopic disease activity was assessed using the Mayo endoscopic subscore (MES) and ulcerative colitis endoscopic index of severity (UCEIS) (        Table 1. In Cohort 2, the positive predictive value (achievement of SFCR) and negative predictive value (NPV; no achievement of SFCR) were 54.5% and 92.3%, respectively (Table 5).

Discussion
In the present study, we analyzed 49 clinical features at week 0 using real-world data with the RF algorithm and determined the contribution of each clinical feature to the achievement of SFCR after 22 weeks of VDZ therapy. It is an advantage of RF that we could investigate the contribution of these various clinical features in our cohorts despite the limited the number of subjects. Generally, it is challenging to assess a large number of features in detail using statistical methodology, such as univariate and multivariate analyses, which require a huge number of subjects. In addition, we believe that we need to interpret the "p-value" in statistical analyses carefully, although we acknowledge statistical significance provides scientific insights. Some factors without statistical significance may potentially contribute to the outcome. Assessing the contribution of factors comprehensively with RF could be a promising approach for identifying predictors, particularly in a complex situation in which various factors can be involved as such SFCR after VDZ treatment. Logistic regression was employed in this study to develop a prediction tool with clinical features using the eight largest contributors; pMayo score, MCH (pg), BMI, BUN (mg/dL), concomitant use of AZA, Lympho fraction (%), height (cm), and CRP (mg/dL). Our model revealed a high NPV (92.3%) for SFCR at week 22. This finding suggests that it would be better to consider other options if our model predicts VDZ will be ineffective for an individual patient. In the logistic regression model, the coefficient of each factor indicates if a factor is positively or negatively associated with the outcome. Our logistic regression model illustrated that a lower pMayo score, higher MCH concentration, lower BMI, higher BUN concentration, concomitant use of AZA, higher Lympho fraction percentage, shorter height, and higher CRP concentration at week 0 were favorable for SFCR at week 22. We believe that interpreting the machine-learning results from medical and physiological viewpoints is crucial for considering the clinical significance of the model, and it could provide an opportunity to improve clinical practice.
A lower pMayo score indicates less clinical disease activity 14 . Higher MCH levels suggest that bleeding attributable to UC and iron, vitamin B 12 , or folic acid deficiency are less severe. Since no patient had overt renal dysfunction in our cohorts, higher BUN levels are believed to reflect the intake of sources of nitrogen, i.e., patients' dietary intake, particularly amino acids. Taken together, these factors imply that less disease activity and a better general and nutritional status are favorable for SFCR during VDZ therapy. In the present study, TCho (mg/dL) was one of the nine strongest contributors in RF, and when we included this feature in the logistic regression model, its coefficient was positive. Because TCho levels are decreased in response to malnutrition induced by active inflammation, this finding also suggests a better nutritional condition is positively related to VDZ efficacy. Barré et al. reviewed several reports on the predictors of VDZ treatment for UC and noted that severe disease activity at induction is a negative predictor 6 . Dulai et al. developed a tool to predict the response to VDZ including baseline moderate activity on endoscopy and albumin levels as positive predictors 7 . Our findings and interpretations of the pMayo score, MCH level, and BUN level appear compatible with these previous studies. Interestingly, lower BMI and shorter height were included as positive predictors in our prediction model. We speculate that these factors suggest a high VDZ concentration in the body because the dose was fixed as 300 mg/ injection. In the GEMINI I study, a positive correlation was observed between VDZ serum concentrations and clinical response 15 . Samaan et al. reported that VDZ dose intensification was effective in patients with IBD with a suboptimal treatment response 16 . In a review by Barré et al., a low trough level of VDZ is cited as a negative predictor 6 . Together with these reports and our findings, we speculate that adjusting the dose of VDZ depending on BMI could increase its efficacy. Meanwhile, caution may be needed when applying our prediction tool to patients with overt emaciation that far exceeds the range observed in the training dataset. It is noteworthy that the concomitant use of AZA was detected as a positive predictor, and the absolute value of its coefficient was the largest in our model; i.e., concomitant AZA use has a larger impact on SFCR at week 22 than the other  Prediction of ( +) 12 10 Prediction of (−) 1 12 Scientific Reports | (2021) 11:16440 | https://doi.org/10.1038/s41598-021-96019-x www.nature.com/scientificreports/ features. Whereas the benefit of the combination of an immunomodulator and VDZ over VDZ monotherapy has not been established, our machine-learning approach identified the potentially beneficial effect of concomitant AZA use. We believe that the results for BMI/height and concomitant AZA use raise an important clinical question concerning the optimization of VDZ treatment for UC. Meanwhile, our finding that a higher Lympho fraction was related to SFCR during VDZ treatment suggests that VDZ responders could comprise a subgroup of UC with a specific pathophysiology. VDZ is a humanized monoclonal antibody directed toward α 4 β 7 integrin. α 4 β 7 integrin is expressed on the surface of lymphocytes, and it interacts with mucosal addressin cell adhesion molecule-1 (MAdCAM-1), which leads to the migration of lymphocytes to the intestine 17 . Based on this specific mechanism and our finding, we speculate that there could be a "lymphocyte-dominant" subgroup of UC, and VDZ exerts particularly efficacy in such patients. The machine-learning approach would be useful for developing a prediction tool and obtaining clues for characterizing UC pathophysiology and subgrouping patients. Our model indicated that higher CRP levels were related to SFCR at week 22. This finding is incompatible with a previous report 6 , and it appears inconsistent with the favorability of a lower pMayo score. Among subjects with and without SFCR at week 22 in the training dataset, the mean and standard error of the mean (SEM) of CRP levels were 1.566 ± 0.6187 mg/dL and 2.054 ± 0.6328 mg/dL, respectively (p = 0.0532, Mann-Whitney U test). However, four subjects who achieved SFCR had a high CRP level (8.37 mg/dL, 6.43 mg/dL, 6.34 mg/dL, and 2.25 mg/dL, respectively), whereas the level was 0.02-1.88 mg/dL in the other patients who achieved SFCR (the normal CRP level is ≤ 0.14 mg/dl). Given that CRP levels were not high overall in Cohort 1, the results of these four patients might affect the decision of the machine-learning algorithm. We consider three future directions of the machine-learning approach for UC clinical data: (1) aiming for higher prediction accuracy, (2) developing prediction tools for various medications, and (3) searching for factors potentially involved in UC pathophysiology. Regarding (1), this study was limited by its small size. Larger training and test cohorts are needed to improve the prediction model and its accuracy. Additionally, it will be interesting to test other machine-learning methodologies, such as k-NN and support vector machine, and determine if those approaches can generate a better model. Point (2) 19 . Several cutting-edge studies are exploring the predictors of VDZ efficacy in patients with IBD. Ananthakrishnan et al. reported that the functional profile of the gut microbiome can be a predictor of VDZ efficacy at week 14 in patients with IBD 20 . Rath et al. analyzed peripheral blood and colonic biopsy samples for CD4 + T cell subpopulations, cytokine production, and mRNA and protein expression including the α 4 β 7 integrin and MAdCAM-1 to investigate factors associated with VDZ efficacy in patients with IBD and revealed a significant difference in genetic signatures at baseline between subjects with and without clinical remission at week 14 21 . Verstockt et al. employed machine-learning methods and reported that the expression of four genes in colon tissue could be predictive of VDZ efficacy in patients with IBD 22 . Gazouli et al. analyzed the mucosal expression of immunological and inflammatory genes using a machine-learning algorithm and demonstrated that the response to VDZ in patients with UC is associated with mucosal gene expression profiles at baseline 23 . Although these findings are interesting, at present, they cannot be feasibly examined in a clinical setting. We believe it is advantageous to analyze common clinical features that can be obtained in a clinical setting to allow application of the predictors and prediction models in daily practice. Regarding point (3), adding experimental factors to the metadata for machine learning may provide opportunities to investigate novel factors associated with outcomes and understand the underlying pathological features of UC. Previous studies demonstrated that mucosal gene expression profiles are related to the treatment response of patients with UC 24,25 . Kim et al. reported that mucosal eosinophilia is a predictor of VDZ efficacy in patients with IBD 26 . These findings suggest the possibility that more factors that contribute to the clinical outcome have not been examined in daily practice. Analyzing various hypothetical predictors (e.g., cytokine levels, gene expression, histological characteristics) together with machine-learning approaches would provide insights into the contribution of each factor and facilitate the discovery of the characteristics of UC subgroups. In conclusion, with machine learning, we determined the contribution of clinical features at week 0 to the achievement of SFCR in patients who received VDZ for UC at week 22 and developed a prediction model. The predictive accuracy was confirmed in a separate cohort. The concept and findings in this study will promote personalized medicine in UC, and they could possibly be extrapolated to other medications and diseases.

Data availability
The data underlying this article will be shared by the corresponding author upon reasonable request.