Differentiation of intestinal tuberculosis and Crohn’s disease through an explainable machine learning method

Differentiation between Crohn’s disease and intestinal tuberculosis is difficult but crucial for medical decisions. This study aims to develop an effective framework to distinguish these two diseases through an explainable machine learning (ML) model. After feature selection, a total of nine variables are extracted, including intestinal surgery, abdominal, bloody stool, PPD, knot, ESAT-6, CFP-10, intestinal dilatation and comb sign. Besides, we compared the predictive performance of the ML methods with traditional statistical methods. This work also provides insights into the ML model’s outcome through the SHAP method for the first time. A cohort consisting of 200 patients’ data (CD = 160, ITB = 40) is used in training and validating models. Results illustrate that the XGBoost algorithm outperforms other classifiers in terms of area under the receiver operating characteristic curve (AUC), sensitivity, specificity, precision and Matthews correlation coefficient (MCC), yielding values of 0.891, 0.813, 0.969, 0.867 and 0.801 respectively. More importantly, the prediction outcomes of XGBoost can be effectively explained through the SHAP method. The proposed framework proves that the effectiveness of distinguishing CD from ITB through interpretable machine learning, which can obtain a global explanation but also an explanation for individual patients.


Materials and methods
Data source and feature selection. The study was approved by the Ethics Committee of the 2nd Xiangya Hospital, Central South University. All experiments were performed in accordance with relevant guidelines and regulations. Informed consent was obtained from the subjects for the participation in the study. Intestinal data were collected from the Second Xiangya Hospital of Central South University, 160 patients with CD and 40 patients with ITB were included in the study. All the patients were with active disease. All cases were combined with the clinical diagnosis and European diagnostic guidelines of CD and ITB 29 . The CD diagnosis is according to clinical, endoscopic and pathological characteristics, as well as the clinical response to Crohn's treatment. The diagnostic criteria of ITB include the following conditions: (1) there were acid-fast bacillus (AFB) or caseous granuloma in pathological diagnosis; (2) clinical recovery was complete with endoscopic mucosal healing and at least six months of antituberculosis therapy. After treatment, endoscopic follow-up was performed for 2-6 months. The institutional review committees of the participating centers have given their approval to this work.
We focus on the integration of basic parameters, including demographic data, clinical manifestations, biochemical indicators and endoscopic performance. The variables included are widely available in the diagnosis of Crohn and intestinal tuberculosis, which means that our diagnostic model has preferably general sense. The descriptive statistics of the dataset in paper are shown in Table 1.
SMOTE for imbalance data. Class imbalance is a challenging issue in data mining and machine learning 30 .
This study includes 160 CD patients and 40 ITB patients, the imbalance rate reaches 1:4. Traditional models tend to predict the sample as the category with the majority of samples. Therefore, an unbalanced dataset learning method is considered in this paper. www.nature.com/scientificreports/ Sampling technology, ensemble method and cost-sensitive learning are the most widely used approaches to resolving the issue of class imbalance 31 . Cost-sensitive learning allocates misclassification costs to different classes 32 . Generally, the cost of a few samples is high, while most samples are low. However, due to the accuracy of classification cost is difficult to obtain, results of the cost-sensitive learning method are usually unstable 33 . The methods based on sampling technology are still the mainstream of unbalanced data processing. The sampling approaches are utilized to alter the original class distribution through over-sampling the minority class or under-sampling the majority class instances. Nevertheless, resampling for majority classes may be a potentially useful training instance, while undersampling may not significantly improve the recognition for minority classes 34 . Rayhan proposed the Cusboost algorithm based on clustering sampling and compared it with several popular methods, including SMOTEboost and Rusboost. Each sampling method has its advantages Through the experiments on 19 public datasets 35 . Considering the small sample data used in this paper, we adopt the SMOTE algorithm for imbalanced data. The basic idea of SMOTE algorithm is to generate a few new samples by KNN technology and combine them with the original dataset 21 . This algorithm can be described as the following processes: Consider a training dataset D = x i , y i m 1 , x i ∈ R d and y i ∈ R l .
Step 1: Calculate each minority sample's k nearest neighbors of using the KNN algorithm. www.nature.com/scientificreports/ Step 2: To produce new sample points, N samples are randomly extracted from k nearest neighbors using random linear interpolation.
where, x i denotes one of the minority samples, x j denotes its neighbor sample and x new express the new samples.
Step 3: Combine the new samples with original samples to generate a new training dataset.
To achieve the balance of each epoch in the training process, the over-sampling rate is determined as follows: Here, N major and N minor denote the number of major class and minor class respectively.
XGBoost algorithm. XGBoost is one machine learning algorithm that shines in practice, which has been yielded state-of-art performance in many industries 36 . In this chapter, we briefly describe the Xgboost algorithm. Given a dataset with m features and n samples D = x i , y i , |D| = n , x i ∈ R m , y i ∈ R . Regularization objective of XGBoost algorithm is: where l is a differentiable convex loss function to measure the difference between the predicted value y i and the target value y i . And the � f = γ T + 1 2 � w� 2 denotes the penalty of model. In detail, T is the number of leaves in the tree and the leaf weights are denoted by w . This term penalizes the complexity of the model. The additional regularization term helps to smooth the final learnt weights to avoid over-fitting. In order to use traditional methods of optimization in Euclidean space, the model is trained in an additive way. Formally, the model greedily adds f t which can improve the most according to Eq. (2), then the objective can be expressed as follows.
where y (t) i indicates the prediction of the i-th at t-th iteration. Utilizing Taylor's second-order expansion, the objective of optimization can be written as follows.
Here g i = ∂ y (t−1) l y i , y (t−1) and h i = ∂ 2 y (t−1) l y i , y (t−1) are the first and second-order gradient statistics of the loss function, respectively.
A simplified objective at iteration t can be obtained through the traditional GBDT training process.
The optimal value for a fixed structure q(x) can be calculated by By expanding Taylor's second-order loss term of objective function, XGBoost retains more information. Simultaneously, the regularization term of branch node weight was utilized to improve the model's performance. In this paper, the XGBoost algorithm is used to train a model for distinguishing CD and ITB.
Explainability for machine learning. Understanding why a mathematical model makes a certain prediction can is of significance in many applications, especially in medical science 37 . A key factor in whether doctors will use machine learning model prediction for clinical decision-making is how they can know how the model makes a prediction. The definition of interpretability is to help people understand the reason and degree of machine learning prediction. Although the machine learning algorithm can model the complex nonlinear between variables, it can no longer provide the parameter estimation associated with predictors and result variables and has low transparency.
In this paper, we introduce the Shapley additional explanations (Shap) method to explain our machine learning model 38 . Shap approach is an additive interpretative model inspired by cooperative game theory. All the features are regarded as contributors, and it has been proved consistent with the importance of features in theory.
(1) www.nature.com/scientificreports/ To better describe the Shap approach, we first introduce a significant concept call Shapley value, which can allocate the cooperation benefit fairly by considering the contribution of each agent. Shapley value of agent i is equal to the average value of the expected contribution of which for a cooperation project. Suppose a cooperation program C = �Ag, v� , including several agents ( Ag = {1, 2, . . . , n}, n ≥ 2 ) and a characteristic equation v(C) = k(≥ i∈C x i ) of each agent's contribution to this project 39 .
Define the marginal contribution of agent i joining the organization S as: Then the Shapley value of agent i can be expressed as follow: Corresponding to the interpretation of machine learning prediction, 'game' refers to the prediction task of a single instance, 'revenue' denotes the predicted value of the instance minus the average predicted value of all instances, while 'player' refers to the instance's features, and they work together to obtain income.
Consider the contribution of each feature to the outcome. It is straightforward to obtain the effect in the linear model. The prediction of a data instance's linear model can be depicted as: where x denotes the instance. x j , j = 1, . . . , p states the feature of each instance. β j is the weight corresponding to x j . The contribution of j-th feature to prediction where E(β j X j ) denotes the average estimated effect value, that is, the contribution is the difference between characteristic effect and average effect.
Each feature's Shapley value is the weighted amount of its total expenditure (prediction) over all possible combinations of features.
where S is a subset of the features used in the model. x denotes the vector of the instance's feature to be interpreted, while the number of features is recorded as p . v x (S) expresses the prediction of features in set S , which is the marginalization of features not included in the set S.
Actually, Eq. (12) performs multiple integrals for each feature that is not include. It is worth noting that the Shapley value of the feature j is explained as follows: compared with the average prediction of the dataset, the contribution of the j-th feature to the prediction of this feature instance is ϕ j . Therefore, the Shapley value of the feature is not the difference of the predicted value after deleting the feature from the model, which can be regarded as the definition of fair expenditure.
Shap method can effectively estimate Shapley value according to the local agency model. This method can quickly implement a tree-based model through linking LIME and Shapley 20 . Shap defines the interpretation as: where g is the interpretation model, z ′ ∈ {0, 1} M is the alliance vector and M denotes the size of the largest alliance. The Shapley value of feature j is recorded as φ j ∈ R . In the alliance vectors, 1 means the corresponding feature exists, while 0 expresses it does not exist. As for interested instance x , all of the alliance vectors are equal to 1, which means that features exist. A can be simplified as: In theory, Shapley value is the only solution that satisfies efficiency, symmetry, virtuality and additivity. Shap method also meets the conditions, which can calculate shapely values. Specifically, the shap method describes the properties of the following three ideals: Missingness.
Missingness means that the attribute of missing feature is zero. x ′ j denote the alliance, a value of zero indicates that a feature in this instance is missing. Theoretically, a missing feature can have any Shapley value without www.nature.com/scientificreports/ compromising local accuracy because it is multiplied by x ′ j = 0 . This property forces the missing feature to obtain a Shapley value of zero. Additivity.
Additivity is also called local accuracy. It implies that the outcome of the model to be explained is equal to the sum of feature's attribute. Where φ 0 = E X f (x) , that is, the average of predicted values of the model. Consistency. Let For any two models f and f ′ , any z ′ ∈ {0, 1} M : Consistency means that if the marginal contribution of the feature increases or remains unchanged due to the change of the model, the Shapley value will increase or remain unchanged accordingly. In Eq. (17), function which is used to convert the alliance of features into effective data instances. That is, the corresponding value mapped to the instance x we want to interpret. Therefore, the global contribution of the variable can be calculated using the Shap method's local contribution. We average the Shapley absolute value of each feature in the dataset.

Simulation experiment
We initially obtained clinical data with 32 features from 200 patients (CD = 160, ITB = 40). This dataset is used to train the proposed method as well as the comparative approaches. Figure 1 demonstrates the overview of proposed framework using an explainable machine learning method. It mainly includes data preprocessing, model input feature selection, imbalance category processing, model establishment and interpretation.

Study design.
Before building the machine model, the significance test method is applied to obtain significant features related to the target. Compared with t-test, Mann-Whitney U test is appropriate for small samples, and does not require data correspond to normal distribution. Therefore, Mann-Whitney U test is used to select continuous variables related to CD and ITB identification. Chi-square test was used for categorical variables. Then, the average method is utilized to deal with missing values of continuous variables, while the counting variables with missing values are marked with other numbers.
Based on this, a total of nine features are chosen as input variables of the classification model. In details, the cohort consisting of 200 samples was stratified random sampling, splitting into 2 datasets-training set (60%) and testing set (40%). Next, upsampling the minor class instances in the training set through SMOTE algorithm. Among which, the training set was used to train a classification model for distinguishing CD from ITB, and the evaluation metrics of methods were reported on thetoprule testing set. Finally, the SHAP method was introduced to explain the output of the model. Evaluation criteria. Differentiating CD from ITB is a binary classification problem. We choose five different functions to evaluate the performance of models, including sensitivity, specificity, precision, area under the receiver operating characteristic curves (AUC) and Matthews correlation coefficient (MCC). Suppose that the instance ITB is a positive class and the instance CD is a negative class. According to the confusion matrix, these criteria can be described as follows: The value of AUC is between 0 and 1, which can directly evaluate the quality of the classifier. The larger AUC value denotes the better performance of a classifier. www.nature.com/scientificreports/ Besides, MCC is introduced in this paper, which is a balanced evaluation criterion applicable to the unbalanced category.
MCC essentially describes the correlation coefficient between the predicted results and the actual values.

Implementation details. Significant indicators were included in classification models through the Mann-
Whitney U test and Chi-square test, including Intestinal surgery, Abdominal, Bloody stool, PPD, Knot, ESAT-6, CFP-10, Intestinal dilatation and Comb sigh (see Table 1).
In this paper, the XGBoost algorithm is compared with two statistical methods and several machine learning. For instance, linear discriminant analysis (LDA), logistic regression (LOG), artificial neural network (ANN), support vector machine with different kernel functions, Bayesian regression (Bayes), random forest (RF) and gradient boosting decision tree (GBDT). Among which, as statistical methods, LDA and LOG are usually employed to solve a binary classification problem. ANN, SVM and Bayes are classic machine learning models based on different theories, which are commonly utilized as benchmark methods of machine learning. Besides, RF and GBDT are considered, two significant approaches in the development of the tree model.
The unbalanced rate of CD and ITB is used to over-sample the minor class instances (ITB), to achieve the balance of each epoch in the training process.
As for machine learning models, whose hyperparameters define the general characteristics, may directly affect its prediction accuracy. Therefore, it's extremely significant to optimize them. In particular, considering the complexity of ANN, this paper uses the single hidden layer network structure. The number of hidden layer neurons is important for the ANN model. Similarly, the optimal penalty coefficient C of SVM-linear, SVM-sigmoid and SVM-RBF is also obtained by cross-validation. For the three tree models (RF, GBDT and XGBoost), the most important parameters are the number of trees and the max feature. A larger number of trees would improve the performance of models, with more calculation cost. What's more, the prediction accuracy would no longer improve if the number of trees exceeds the special value. The max feature is determined by the features of the square root. The number of trees is optimized by cross-validation and the other parameters are gained by default values. In detail, parameter k for k-nearest neighbors of SMOTE algorithm is choose as 3. The regularization parameter , learning rate, number of trees and the max feature are ultimately determined as 0.01, 0.1, 100 and 5. Other hyperparameters are the default values. All of parameters are ultimately determined by fivefold crossvalidation. To make the result more reliable, each model is run 500 times to obtain an integrated average forecast. All analyses were carried out using Python, version 3.6.5 on a Dell server with 16 GB RAM. www.nature.com/scientificreports/ Results and analysis. The comparison between different classifiers illustrates that the XGBoost algorithm yields a promising performance with a mean AUC of 0.891. It also outperforms other classification models in terms of sensitivity, specificity, precision and MCC (see Table 2). Naive denotes the model without applying the SMOTE algorithm, which means does not use the class imbalance method. Results depict that applying SMOTE algorithm can improve the prediction performance. Among these methods, RF and GBDT are second only to the XGBoost algorithm. Notably, the performance of specificity is superior to sensitivity in most models. In this work, specificity indicates the probability to detect CD correctly while sensitivity expresses the extent to correctly detect ITB. This may mean that the identification of ITB is more challenging. To sum up, compared with traditional statistical methods, the machine learning algorithm performs better in the classification of CD and ITB. Figure 2 shows the SHAP summary plot for the XGBoost model. In our experiment, CD and ITB are coded as 0 and 1 respectively. Each point in the figure represents a sample, that is, the patient. The horizontal coordinate represents the Shapley corresponding to each feature of each sample. A positive value indicates that the prediction probability of ITB would be improved. The corresponding negative value denotes that the prediction probability is reduced, which means the probability of CD would be increased. The color in the plot represents the value of the feature, red denotes the feature with a large value, while blue represents a feature with a small value. For a binary variable, red color denotes 1 (positive) and blue color denotes 0 (negative). For instance, the Shapley value of the majority red sample in CFP-10 feature, indicating that positive CFP-10 would improve the probability of ITB patients. The majority blue sample of CFP-10 denotes that negative would improve the probability of CD patients. In term of color discrimination, Abdominal, CFP-10 and comb sign can be more effective to distinguish CD and ITB. Besides, the long right tails of abdominal in the summary plot mean that it is rare but may a high-magnitude risk factor.
It should be emphasized that a feature with low differentiation does not mean that it is not important, which reflects some characteristics that may not occur in these two diseases. The main advantage of the SHAP summary plot is that the effect of different variables on the prediction can be in a highly visual way.
We can use the SHAP method to gain a global explanation for our prediction (calculated by Eq. 22). The global explanations are obtained by calculating the SHAP explanations for all individual patients and then averaging them per feature. Figure 3 gives the bar chart plot for the nine significant variables contributing to the XGBoost model's forecasting for identifying CD and ITB. The greater the length, the greater the importance of www.nature.com/scientificreports/ the variable. It can be seen that features abdominal, CFP-10, Comb sign, PPD and intestinal surgery are more important predictors to distinguish CD from ITB. More importantly, the SHAP method can visualize the effect of input variables on each patient (see Fig. 4). The base value is the average of all predicted values of the model on the testing set. Features with red color and blue color indicate that increases (positive values) and decreases (negative value) the prediction compare with its baseline.
The first patient was diagnosed with CD, and the probability of CD predicted by the model is 0.97 (1-0.03). Intestinal surgery, CFP-10, comb sign and abdominal of this patient are positive. Among which, comb sign and abdominal decrease the prediction of ITB while the others increase. For him, the comb sign and abdominal play a more important role in decreasing the prediction. That's why the model diagnosed him as a CD.
The second patient is an ITB with a prediction probability of 0.85. As for this patient, abdominal (positive) and knot (negative) decrease the prediction of ITB. However, ESAT-6 (positive), intestinal surgery (positive), PPD (positive), comb sign (negative) and CFP-10 (positive) play a more important role in increasing the prediction probability.
The third patient was diagnosed with CD with a prediction probability of 0.81 (1-0.19). ESAT-6, Knot, CFP-10, and Abdominal of this patient are positive, while the Comb sign, PPD, intestinal surgery are negative. In details, PDD, Abdominal and Intestinal surgery increase the prediction of CD while the others decrease. It is worth noting that the effects of the same characteristics on different individuals are different. For example, CFP-10 (positive) in the first patient and third patient increase the prediction of ITB, while the contribution intensity is different (corresponding to the length of the color in the figure).

Discussion
In recent years, the morbidity of CD has increased significantly with the industrialization of many countries. Meanwhile, the incidence of intestinal tuberculosis (ITB) has been increasing. Distinguishing Crohn's disease (CD) from intestinal tuberculosis (ITB) has always been a challenge for clinicians in developing countries. PPD and the tuberculosis (TB) interferon-gamma (IFN-γ) release assay (TB-IGRA) are both associated with mycobacteria, and have a relatively high sensitivity and specificity for the diagnosis of ITB, especially TB-IGRA. However,  www.nature.com/scientificreports/ in addition to active TB infection, individuals with latent infection or past infection can also have a positive result of TB-IGRA and PPD. As we know that tuberculosis is still prevalent in developing countries, so positive result of TB-IGRA and PPD can be detected in a considerable proportion of the population in China, including some of the CD patients. Since IGRA and PPD have some difficulties in distinguishing active TB infection from latent or past TB infection, empirical anti-tubercular therapy (ATT) trial, and subsequent clinical and endoscopic response to ATT is still required in a significant proportion of patients to confirm the diagnosis. In our research, some of the CD patients with insufficient phenotype and positive PPD or TB-IGRA were established the final diagnosis of CD by empirical anti-TB therapy. However, ATT trial is associated with a delay in the diagnosis of CD, which may lead to poor prognosis and even serious side effects. Better method for improved differentiation is needed to reduce the need for ATT trial. Many researchers have been established several models to address this issue by using the clinical symptoms, laboratory tests, endoscopic findings and so on 3,[9][10][11][12]40 42 . It is undeniable that the larger the sample size, the more meaningful the results are. However, those studies didn't pay enough attention to the interpret of diagnostic models, especially the machine learning.
In terms of methods, the logistic regression model is one of the most popular in the field of healthcare. It's easy to understand the results through weights in the equation. Nevertheless, the logistic model is usually not good from the perspective of prediction ability. The comparison between different classifiers illustrates that the XGBoost algorithm yields a promising performance with a mean AUC of 0.891. It also outperforms other classification models in terms of sensitivity, specificity, precision and MCC. Among these methods, RF and GBDT are second only to the XGBoost algorithm. Notably, the performance of specificity is superior to sensitivity in most models. In this work, specificity indicates the probability to detect CD correctly while sensitivity expresses the extent to correctly detect ITB. This may mean that the identification of ITB is more challenging. Also, the explanation of weight is not intuitive when the correlation or complex relationship occurs in variables. Prior studies have noted the advantages of machine learning methods in fitting the complex relationship between predictors and targets. Thus, machine learning can perform better than traditional statical methods in out-of samples. However, their prediction results are difficult to be accepted by medical faculty due to the low transparency, although many researchers are devoted to helping people understand how machine learning models make such predictions. For instance, partial dependence plots, accumulated local effects, feature interaction, feature importance, global surrogate models and tree models. There are still several limitations among these methods: (1) only single feature can be explained effective; (2) independence condition must be satisfied; (3) only the global interpretation can be obtained. Besides, all of them have no solid theory that the contribution of features can't be calculated reasonably. In our study, the model's interpretability makes it possible to determine the contribution rate of a single variable in the prediction, which reflects the individualization of the prediction model. To sum up, compared with traditional statistical methods, the machine learning algorithm performs better in the discrimination of CD and ITB.
SHAP, a method to explain an individual prediction, is the only explanation approach with solid theory at present. This method was also used to predict GI bleed mortality in the intensive care unit, which yield a proposing performance 43 . Base on this, we propose an interpretable machine learning framework for distinguishing CD from ITB, which combing SMOTE algorithm and XGBoost method with SHAP method. Results prove that machine learning is superior to the traditional methods for differentiating CD and ITB. What's more, the SHAP method can effectively obtain a global explanation but also an explanation for individual patients. This work may improve medical workers' acceptance of prediction outcomes by machine learning without sacrificing accuracy.
Nonetheless, there are certain limitations to this paper. Because all the samples in our research were collected from a single center, the sample size, especially the ITB patients are somewhat small due to the limitation of clinical reality. Further research is required to establish the framework through more samples and include more variables. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.