Introduction

Schistosomiasis japonicum is an infectious parasitic disease with serious consequences, widely distributed in tropical and subtropical regions of Asia, Africa and other continents1. According to WHO reports, schistosomiasis is still spreading and prevalent in 52 countries, affecting the health and quality of life of millions of people2. The cercariae of Schistosoma japonicum penetrate the human skin and enter the liver through the blood circulation, and then spawn in large numbers. Inflammatory granulomas form around schistosome eggs, and liver fibrosis develops gradually around this focus3. If patients are not treated in time, they are more likely to experience the serious consequences of cirrhosis when combined with other liver diseases. Early clinical diagnosis and treatment can increase the degree of improvement of liver fibrosis in schistosomiasis. In order to improve the quality of life and effectively reduce the risks of liver cirrhosis, peritoneal effusion, and liver cancer, early prediction and diagnosis of liver fibrosis has become an important problem to be solved in the field of diagnosis and treatment of liver fibrosis in schistosomiasis. At present, serological biomarkers and transient elastography are widely accepted clinically as the main basis for the diagnosis of early liver fibrosis4. But both have the same problem, that is, it is difficult to accurately diagnose liver fibrosis in stages. The stability of transient elastography measurements is easily disturbed by sampling errors, differences in instrument use, and other factors, which have certain clinical limitations5.

Machine learning is an artificial intelligence method used to process large amounts of complex and multi-type data, and it has achieved breakthroughs in the application of complex medical problems6. If the advantages of machine learning methods in describing complex data structures can be used, the degree of development of liver fibrosis in schistosomiasis can be accurately predicted and diagnosed. It can provide valuable early evidence for clinical treatment, thereby improving the quality of life and prognosis of patients.

The purpose of this study is to determine the influencing factors of liver fibrosis in schistosomiasis, based on the data of blood routine examination, to establish a machine learning model for early prediction of liver fibrosis in schistosomiasis.

Results

Baseline information

This study included 1049 patients, and the baseline table of the total population is shown in Table 1. The median age was 62.0 years (range 51.0–71.0). In the whole population, 281 patients (26.79%) had significant liver fibrosis, and 768 patients (73.21%) had no significant liver fibrosis.

Table 1 Baseline.

Variable screening

A total of 10 key factors were selected by the LassoCV method: ‘RDW-SD’, ‘MCHC’, ‘MCV’, ‘HCT’, ‘Red blood cells’, ‘Eosinophils’, ‘Monocytes’, ‘Lymphocytes’, ‘Neutrophils’, ‘Age’.

Multi-algorithm model comparison

Using 6 machine learning model algorithms for classification, among the 6 different machine learning algorithms, LightGBM performed the best, and its AUCs in the training set and validation set were 1 and 0.818, respectively (Fig. 1A,B). At the same time, its cutoff value, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, F1 score, and Kappa value are 0.876, 0.807, 0.709, 0.842, 0.842, 0.803, 0.769, and 0.394, respectively. The evaluation results of other machine learning algorithms are shown in Table 2 and Supplementary Table 1. The forest plot in Supplementary Fig. 1 shows the ROC results of each model, and the error bar in the figure is the SD of the ROC mean. The clinical decision curve in Supplementary Fig. 2 shows the LightGBM performs well and is more stable.

Figure 1
figure 1

Multi-model comparison diagram. (A) Figure A shows the AUC of multiple models in the training set. Each color represents a machine learning algorithm. (B) Figure B shows the AUC of multiple models in the validation set.

Table 2 Multi-model classification—validation set results.

Best model

After comparing multiple models, it was found that LightGBM performed best, and we used LightGBM for modeling. The AUC in the training set was 0.995, the AUC in the validation set was 0.804, and the AUC in the test set was 0.8367 (Fig. 2A–C). At the same time, we can see that during cross-validation, when the sample size of the training set and the validation set reaches 400, the model reaches a stable state (Fig. 2D). Supplementary Tables 24 showed the metrics for model evaluation on the training set, validation set, and test set, respectively.

Figure 2
figure 2

AUC of the LightGBM model. (A) AUC of the LightGBM model in the train set. (B) AUC of the LightGBM model in the validation set. (C) AUC of the LightGBM model in the test set. (D) Figure shows that the AUC of the LightGBM model changes according to the training sample size. The abscissa represents the sample number, and the ordinate represents the ROC value.

Model interpretability

The SHAP diagram in Fig. 3A showed how each variable in the validation set contributes to the prediction of infection. The redder each point means that the absolute value of the point is larger, and the bluer the point, the smaller the absolute value of the point. The ordinate is a negative absolute value The larger the value, the greater the possibility of the predicted result being negative, and the greater the absolute value of the positive number on the vertical axis, the greater the possibility of the predicted result being positive. For example, the larger the RDW-SD value, the greater the possibility of liver fibrosis in patients, and the lower the possibility of liver fibrosis in patients with higher lymphocyte and neutrophil counts. Figure 3B showed the importance ranking of each variable. We can see that RDW-SD, lymphocytes and neutrophils are more important variables. Figure 3C and Fig. 3D used two force diagrams to show how the variables of the two samples affect the results. As shown in Fig. 3C, the patient was predicted to be infected, but was actually infected. We can see that the longest red arrow is neutrophils (0.93), indicating that neutrophils are the most important for the patient’s infection. The outcome had the largest positive contribution, and the second largest positive contribution was red blood cells (3.69). There were no variables that had a negative contribution to the outcome. In Fig. 3D, the patient was predicted not to have an infection, but in fact no infection occurred. The three variables that had the most positive impact were the number of neutrophils (1.71), red blood cells (3.47), and age (77.0), the two variables that had the most negative impact on the outcome were RDW-SD (42.7) and MCV (98.3).

Figure 3
figure 3

Interpretability of the model. (A) SHAP diagram. Each point represents a sample. The redder the color of the point, the larger the value of the variable, and the bluer the red, the smaller the value of the variable. The larger the ordinate of the point, the more likely the outcome is to be positive. (B) Importance ranking of key variables. The abscissa is the absolute value of the SHAP value, and the ordinate is the key variable. (C) The samples with a positive outcome. Red indicates a positive contribution to a positive outcome, and blue indicates a negative contribution to a positive outcome. The length of the bar indicates the size of the contribution. The longer the bar, the greater the contribution to the outcome. (D) The samples with negative outcome.

Discussion

After infecting the host, Schistosoma japonicum produces a large number of eggs and deposits them in tissues such as the liver. If timely and effective intervention is not performed, changes such as egg granuloma and liver fibrosis may further develop into hepatocellular carcinoma7. Studies have shown that liver fibrosis is not a single irreversible progression, and liver fibrosis may have the potential to regress8. Therefore, it has positive significance in the early diagnosis and treatment of liver fibrosis. At present, schistosomiasis has not attracted enough attention in major endemic countries, resulting in relatively lagging clinical and basic research on schistosomiasis, and there are few basic data research on schistosomiasis liver fibrosis9. This study predicts the risk of liver fibrosis by constructing a diagnostic model, which has important clinical significance for early and correct treatment and intervention.

This study uses a machine learning model to predict liver fibrosis in Schistosomiasis japonicum, helping clinicians to deeply understand the impact of key factors on liver fibrosis. It is helpful for early identification of liver fibrosis and distinguishing the severity of liver fibrosis, so as to timely detect patients with early liver fibrosis and improve the prognosis of them. In this study, the data of 1049 patients with Schistosomiasis japonicum were analyzed to establish a liver fibrosis prediction model using machine learning algorithms to help identify patients at high risk of liver fibrosis. The model established in this study is well discriminative and exhibits satisfactory specificity and sensitivity.

After screening out 10 key factors, the research uses 6 different machine learning algorithms to classify. Compared with other models, the LightGBM algorithm has better performance and higher stability, and the AUC of the optimal model is 0.8367. In the evaluation of the importance of model variables, the top three indicators with positive contribution to the outcome of liver fibrosis are neutrophils, red blood cells, and age, while the indicators with the largest negative contributions are RDW-SD and MCV. Except for the patient’s age, other indicators are related to blood routine.

Overall, the key variables included in the model may play an important role in the early diagnosis of Schistosoma japonicum liver fibrosis. Previous reports point out that there is an inseparable relationship between blood routine indicators and liver fibrosis10, and the results of this study also support this association. The neutrophil-to-lymphocyte ratio (NLR) is widely used to assess inflammatory diseases. The study found that for patients with nonalcoholic fatty liver disease (NAFLD), NLR was significantly correlated with liver fibrosis stage and nonalcoholic fatty liver disease activity score (NAS); For chronic hepatitis B (CHB) patients, NLR was negatively correlated with liver fibrosis stage11,12,13,14. Therefore, NLR may be associated with the stage of liver fibrosis. Kekilli et al. also demonstrated that the ratio of neutrophils to lymphocytes reflects the severity of advanced liver fibrosis15. RDW is a parameter reflecting the heterogeneity of red blood cell volume, which is often used to diagnose different types of anemia, and is closely related to the body’s inflammation and nutritional status. Elevated RDW often indicates shortened lifespan and increased destruction of red blood cells. Michalak et al. believe that RDW and its derivatives may be related to the deterioration of liver function16. Studies have shown that RDW is closely related to liver fibrosis in diseases such as NAFLD and CHB17,18,19. RDW can be expressed as RDW-CV and RDW-SD. RDW-SD is determined by the width of the red blood cell volume distribution curve above 20% above baseline. Studies have shown20 that RDW-SD is closely related to significant liver fibrosis (F2–F4) in CHB and can be used as an effective predictor for significant liver fibrosis in CHB. Liu et al.21,22,23 also found that only RDW-SD had a statistically significant difference between different stages of liver fibrosis in AIH (P = 0.046). In univariate Logistic regression analysis, RDW-SD was a risk factor for advanced liver fibrosis (F3–F4) in AIH. MCV is a parameter that reflects the volume of red blood cells, and changes in MCV suggest that the patient’s hemoglobin synthesis is impaired. Liu et al.21 further found that MCV had statistically significant differences among different stages of liver fibrosis in AIH and was positively correlated with the severity of liver fibrosis. The combination of MCV and RDW can comprehensively reflect the discrete state of peripheral red blood cell volume. So far, the mechanism between RDW, MCV and liver fibrosis is unclear, and may include the following points: (1). Inflammatory cytokines may inhibit the maturation of red blood cells and accelerate the entry of newer and larger reticulocytes into the peripheral circulation, resulting in increased RDW; (2). Patients with liver disease often have decreased intestinal absorption function, resulting in folic acid, vitamin B12 and other deficiencies, resulting in varying degrees of megaloblastic anemia and heterogeneous changes in red blood cell volume; (3). Hepatic fibrosis often causes splenomegaly and hyperfunction, which accelerates red blood cell destruction and shortens the lifespan of red blood cells, which may promote the release of immature red blood cells and eventually lead to increased RDW17,24,25. These studies provide a theoretical basis for the correlation between blood routine indicators and liver fibrosis, but the magnitude of the correlation and the degree of liver function deterioration have not been clearly quantified, nor have they provided a predictable space for early liver fibrosis. Machine learning can make up for this deficiency. This study also find that age is also a key variable associated with liver fibrosis in Schistosomiasis japonicum, and the model predicts that the older the age, the greater the possibility of liver fibrosis. The significance of the machine learning method for this study lies in the establishment of a clinical prediction and identification model through simple blood routine indicators and patient age to give suggestions for the diagnosis of complex liver fibrosis.

This study built a machine learning model and evaluated the model by taking advantage of abundant data. Compared with the models mentioned in the published literature, this study only needs blood routine, age and gender to predict, providing clinicians with a more easy-to-operate and understandable diagnostic method.

But this study also has certain limitations. This study is a single-center retrospective study and some of the results discussed are also for an individual patient, which may not be able to avoid inherent selection bias and information bias. The next step of the study needs to conduct multi-center prospective research for external verification to further improve and promote this machine learning model. The variables of the current model only include the patient’s clinical information and test results. In order to optimize the performance of the identification model, the model can also include biomarkers from microbiome and metabolomics. However, at present, only using clinical variables can also reduce the burden on patients to a certain extent, and it has a certain degree of convenience in clinical application. Finally, the insufficient interpretability of SHAP values warrants the development of more understandable models in the future. In the future, we will further develop an automatic clinical scoring system based on nomograms or machine learning based on research data in order to provide clinicians with more practical and easy-to-understand tools.

Methods

Study population

The study population consisted of patients diagnosed with Schistosoma japonicum in Yueyang, Hunan Province, China. This city has historically been a high schistosomiasis epidemic area. Because it was located near Dongting Lake in the middle and lower reaches of the Yangtze River, where the Intermediate host Oncomelania hupensis breeds in large numbers.

Schistosoma japonicum infection was diagnosed according to the definition of Zhou et al.26. Including the following diagnostic criteria: life history in schistosomiasis-endemic areas, contact with infected water, specific schistosoma serology testing, color ultrasound, excreta (feces, urine) microscopic examination. Schistosomiasis infection was considered when schistosome ova were visualized in stool, urine or when the Schistosoma serology was positive.

Liver fibrosis was determined by ultrasound according to the World Health Organization diagnostic criteria for Schistosoma japonicum infection27,28. An experienced ultrasound expert divided the patients into two groups according to the ultrasound results: fibrosis group (with mesh-like changes and uneven hepatic echotexture); no-fibrosis group (without mesh-like changes, smooth and uniform hepatic echotexture). The diagnosis was double-checked by another experienced schistosomiasis specialist.

Data collection

A retrospective medical record review was conducted from June 2019 to June 2022 at Xiangyue Hospital, Yueyang City, Hunan Province of China. All patients underwent blood tests and ultrasound evaluation at admission. All variables were extracted from the hospital’s electronic medical record system. The data include: patient demographic characteristics, blood routine indicators and other variables. KNN filling method is used to fill in the missing data. The principle is to identify k samples that are spatially similar or close in the data set through distance measurement, and then use these k samples to estimate the value of the missing data point. The percentage of missing data points is presented in Supplementary Table 5. The LassoCV method was used to screen out key variables. Data entry was performed by a full-time research physician or medical student. This study was conducted and approved by the Ethics Committee of the third Xiangya Hospital of Central South University (No: 21149) and has been carried out in accordance with the Code of Ethics of the World Medical Association (Declaration of Helsinki) for experiments. All methods were performed in accordance with the relevant guidelines and regulations. The need of informed consent was waived by the Ethics Committee of the third Xiangya Hospital of Central South University due to retrospective nature of the study. The privacy of all participants is fully protected.

Feature selection

Patients were divided into hepatic fibrosis and non-hepatic fibrosis groups according to their color Doppler ultrasound results. Patients with hepatitis B virus (hepatitis B surface antigen seropositive), hepatitis C virus (HCV antibody seropositive), human immunodeficiency virus (HIV antibody seropositive), alcoholic and non-alcoholic fatty liver disease (ultrasound scanning and alcohol consumption above 30°g daily), decompensated liver disease or liver cancer (ultrasound and liver function tests), and organ transplantation (self-reported) were excluded. The key variables are selected by LassoCV method for subsequent modeling.

Study design

First, the classification task was completed using 6 machine learning algorithms, including: ‘XGB Classifier’, ‘Logistic Regression’, ‘LightGBM Classifier’, ‘Random Forest Classifier’, ‘Support Vector Classification’, ‘K Neighbors Classifier’. Fivefold cross-validation method was used for validation. Each model was evaluated using AUC, clinical decision curve plot, accuracy, sensitivity, specificity, positive predictive value, negative predictive value, and F1 score. The ROC diagram and the forest diagram show the ROC results of each model for the prediction of “hepatic fibrosis”.

After selecting the best algorithm through multi-algorithm model comparison, the best algorithm was used to model again. Different from multi-model comparison, when using the best-performing algorithm for modeling, we randomly select 15% of the total samples as the test set, and the remaining samples are used as the training set for fivefold cross-validation.

Model interpretation

The SHAP package in python can interpret the output of machine learning models, considering all features as “contributors”. For each prediction sample, the model will generate a prediction value, and its biggest advantage is that it can reflect the influence of the characteristics in each sample and show the positive and negative effects. This study used the SHAP package to interpret the model. SHAP value plots were used to show the contribution of each variable in the model. Model variable importance plots were used to show the importance ranking of each variable. Force diagrams were used to illustrate how each variable affects the predicted outcome for each sample with two examples.

Statistical method

The python used in this study is version 3.7. The statsmodels 0.11.1 package in Python was used to count whether each variable was different between two groups of people. The analysis method was selected according to the distribution of samples, homogeneity of variance, and sample size. Chi-square test was used for categorical variables. Student’s t-test or Mann–Whitney U-test was used for quantitative variables.

In this study, LassoCV was used to screen key variables, and factors with a coefficient of 0 were automatically eliminated (sklearn 0.22.1 package in Python). Lasso obtains a more refined model by constructing a penalty function, so that it compresses some regression coefficients, that is, forces the sum of the absolute values of the coefficients to be less than a certain fixed value; at the same time, sets some regression coefficients to zero. Therefore, the advantage of subset shrinkage is preserved, and it is a biased estimate for dealing with data with multicollinearity. In the multi-model and best-model modeling process, the xgboost 1.2.1 package of Python is used for XGBoost algorithm modeling, the lightgbm 3.2.1 package of Python is used for LightGBM algorithm modeling, and the sklearn 0.22.1 package of Python was used to build other models. The shap 0.39.0 package in python was used to demonstrate the interpretability of the model.

Ethical standards

Ethics approval was obtained from the Ethics Committee of the third Xiangya Hospital of Central South University.