A machine learning model for distinguishing Kawasaki disease from sepsis

KD is an acute systemic vasculitis that most commonly affects children under 5 years old. Sepsis is a systemic inflammatory response syndrome caused by infection. The main clinical manifestations of both are fever, and laboratory tests include elevated WBC count, C-reactive protein, and procalcitonin. However, the two treatments are very different. Therefore, it is necessary to establish a dynamic nomogram based on clinical data to help clinicians make timely diagnoses and decision-making. In this study, we analyzed 299 KD patients and 309 sepsis patients. We collected patients' age, sex, height, weight, BMI, and 33 biological parameters of a routine blood test. After dividing the patients into a training set and validation set, the least absolute shrinkage and selection operator method, support vector machine and receiver operating characteristic curve were used to select significant factors and construct the nomogram. The performance of the nomogram was evaluated by discrimination and calibration. The decision curve analysis was used to assess the clinical usefulness of the nomogram. This nomogram shows that height, WBC, monocyte, eosinophil, lymphocyte to monocyte count ratio (LMR), PA, GGT and platelet are independent predictors of the KD diagnostic model. The c-index of the nomogram in the training set and validation is 0.926 and 0.878, which describes good discrimination. The nomogram is well calibrated. The decision curve analysis showed that the nomogram has better clinical application value and decision-making assistance ability. The nomogram has good performance of distinguishing KD from sepsis and is helpful for clinical pediatricians to make early clinical decisions.

Kawasaki disease (KD), also known as mucocutaneous lymph node syndrome, is most common in children under 5 years old. It is an acute systemic vasculitis that mainly damages small and medium-sized blood vessels. Tomisaku Kawasaki first reported this in 1967 1,2 . The prevalence of KD varies widely across ethnic groups and currently ranges from 4-25/100,000 in children under 5 years old in North America, Australia, and Europe [3][4][5] .
Compared to the United States and Europe, the incidence is 10-20 times higher in Japan, Korea, and Taiwan [6][7][8] .
Additional studies have shown that the incidence of KD continues to rise in Asia [9][10][11] . The main clinical features of KD are fever, bilateral nonexudative conjunctivitis, oral changes, erythematous rashes (appear in the acute phase of the disease and affects 80-90% of patients with KD), extremity changes (appear in the acute and subacute phases of the disease and affect 80-90% of patients with KD), and cervical lymphadenopathy (appear in the acute phase of the disease and affects 50-60% of patients with KD) 12,13 . The studies showed that, except for the fever, conjunctivitis was the most frequent in children with KD, while cervical lymphadenopathy was the least frequent in children with KD 14,15 . Children who do not meet sufficient primary clinical presentations are diagnosed with incomplete KD 12 . The incidence of major clinical manifestations of incomplete KD, except for the fever, was less than that of complete KD, while cervical lymphadenopathy was the most pronounced 14,15 . The prevalence of incomplete KD varies from region to region, ranging from 16.1 to 48.4% [16][17][18][19] . Yahui et al. found that the incidence of both complete and incomplete KD increased over time, but that the incidence of incomplete KD increased more rapidly than that of complete KD (0.35 cases per 100,000 per year for incomplete KD compared to complete KD) 20 . The main complication of KD is coronary artery abnormalities (CAL), but additional coronary complications may also occur. The incidence of coronary artery aneurysms is about 20-30% in untreated cases 21,22 . Children younger than 6 months tend to have incomplete KD, which often delays their diagnosis and treatment 23 . These patients tend to be at high risk for CAL and intravenous immunoglobulin (IVIG) resistance 12,[23][24][25] . Adults with coronary artery diseases are usually diagnosed retrospectively with incomplete KD 26,27 . However, the etiology of KD is still unknown. Based on multiple studies, the consensus among clinical researchers is that KD is an immune-mediated disease caused by an infection in patients with genetic predisposition [28][29][30][31] . Currently, in the absence of etiological detection, we diagnose KD mainly on the basis of clinical manifestations and the exclusion of other known pathogenic diseases with similar clinical manifestations to KD. A systematic review by the Cochrane Collaboration showed that a single dose of 2 g/kg IVIG administered before day 10 after onset reduces the development of CAL 32 . Therefore, according to current guidelines, high-dose IVIG remains the first-line treatment for KD 12 . However, delayed or missed diagnosis of KD in clinical practice may also lead to over-medication and even invasive treatment of children with KD 33 . Therefore, early and correct diagnosis and treatment of KD are crucial for prognosis. Sepsis is a systemic inflammatory response syndrome caused by infection and a significant cause of septic shock and multiple organ dysfunction syndromes. In high-income countries, more than 4% of hospitalized children and 8% of PICU children suffer from sepsis [34][35][36][37][38] . Mortality in children with sepsis ranges from 4 to 50%, depending on disease severity, risk factors, and geographic location 34,35,[39][40][41] . Therefore, early diagnosis and appropriate treatment are essential to optimize outcomes in children with sepsis.

Results
Characteristics of the overall patients. Our study recruited 309 sepsis patients and 299 KD patients, including 22 patients with incomplete KD and 2 with Kawasaki disease shock syndrome (KDSS). 221 sepsis and 205 KD formed our training set (N = 426), and 88 sepsis and 94 KD formed our validation set (N = 182). The baseline characteristics between the KD and sepsis groups are shown in Table S1 of Supplemental Material. The results of comparing KD patients and sepsis patients in the training and test sets are shown in Table 1.
There was a significant difference in weight, WBC, neutrophil, lymphocyte count, monocyte, eosinophil, LMR, platelet to lymphocyte count ratio (PLR), albumin (ALB), albumin/globulin (AGR), prealbumin (PA), hematocrit (HCT), platelet (PLT), red blood cell count (RBC), alanine aminotransferase (ALT), γ-glutamyltransferase (GGT), sodium (Na), blood urea Nitrogen (BUN), calcium (Ca) (p < 0.05) between the KD patients and the septic patients in both the training and test groups. In addition, the differences in age, height, red blood cell distribution width (RDW), and total protein (TP) in the training set were statistically significant, so they were included in the variable selection.
Characteristics of training and validation sets. 70% of patients were randomly assigned to the training set and the remaining 30% to the validation set. There is no significant difference between the training set and the validation set of variables (p > 0.05), which means that the difference between the training set and the testing set is slight, and the model is stable (Table 2).

Risk factors selection for construction of the nomogram.
Initially, there were 24 variables with statistically significant differences, and then the data from the training set was screened for predictors. Twentyfour variables were reduced to 15 potential predictors using the least absolute shrinkage and selection operator (LASSO) (Fig. 1a,b) and 21 using support vector machine (SVM) (Fig. 1c). Using SVM and LASSO to take the intersection (Venn diagram) (Fig. 1d), 13 variables are used to build the model. We converted these 13 continuous variables into categorical variables based on the cut-off value at the maximum area under the curve (AUC) ( Fig. 2a and Table 3) and built a multiple logistic regression model. The forest plot (Fig. 2b) and Table 4 show that the final logistic regression model contains 8 independent predictors. Collinearity between different variables is represented in Fig. 2c. According to logistic regression results, a nomogram ( Fig. 3a) was created online at https:// hanch enchen. shiny apps. io/ KD-nom/ (Fig. 3b).
Evaluations of the predictive model. We evaluated the performance of the nomogram using calibration and discrimination. We internally validated and calibrated the nomogram through 1000 bootstrap analyses. The calibration curve (Fig. 4a,b) results showed no significant deviation from the fit. The predicted results are in accordance with the results showing that the nomogram has a good prediction. The mean absolute errors of the training and validation set were 0.006 and 0.057. The receiver operating characteristic (ROC) (Fig. 4c,d) curve was used to evaluate the diagnostic value of the selected indicators showing good discrimination ability. The area under the ROC curves for the training and validation set was 0.926 and 0.878, respectively. The DCA of the predicted nomogram (Fig. 4e,f) is shown in the figure, which indicates that the model has better clinical application value and decision-making assistance ability.

Discussion
Currently, clinicians mainly diagnose KD by clinical manifestations and echocardiography, but no specific laboratory method is available for diagnosis. On the one hand, KD patients do not have typical enough clinical features in the early stage, or parents fail to elaborate on typical symptoms with doctors. On the other hand, sepsis and KD patients have similarities in early clinical manifestations and laboratory tests. In addition, incomplete KD and KDSS are challenging to distinguish from sepsis, which adds some difficulties for clinicians in making a www.nature.com/scientificreports/ precise diagnosis. However, timely diagnosis and treatment of KD can prevent the occurrence of serious sequelae such as CALs. Therefore, we developed a predictive model for diagnosing KD by comparing the laboratory results and epidemiological data of children with KD and sepsis. The model demonstrates good discrimination, predictive power, and clinical utility and can help pediatricians in the early diagnosis of KD by using data from the children's first visit. In our study, height was the most important predictor. However, there was no statistically significant difference in BMI between KD and sepsis. According to the result, when the child's height is greater than or equal to 74.5 cm, the child is more likely to have KD. Whether this means the age is closer to 1 year old or older, the child is more likely to suffer from KD is worthy of study. Only limited literature is available on KD below the age of 1 year and little information is available from developed countries on this subject 51 52,53 . This shows that the prevalence of KD in children under one year of age is lower than in children over one year of age, which in a sense is consistent with our finding that children with a height greater than or equal to 74.5 cm (close to the height of a one-year-old child) are more likely to develop KD. www.nature.com/scientificreports/ KD and sepsis patients have leukocytosis in the acute phase, and our study shows that WBC counts are less elevated in KD patients than in sepsis patients and can act as an independent risk. Several studies have reported leukocytes as a biomarker for the diagnosis of KD. In a case-control study by Liu S. et al., WBC counts ≥ 11.12 × 10 9 /L played a predictive role in differentiating KD from non-KD febrile infectious diseases 54 . In addition, in another article by them, WBC < 19.7 × 10 9 /L played an essential role in helping to differentiate KD from sepsis, which is almost consistent with our conclusion that WBC is minor than 16.1 × 10 9 /L in KD patients 50 . In a retrospective study, patients with sepsis had more WBCs (17.94 ± 10.04 × 10 9 /L) than those with viral infection (10.42 ± 4.21 × 10 9 /L) (p < 0.001) 55 . The above may suggest that the immune mechanism of KD is intermediate between sepsis and viral infectious diseases. In addition, Tian Xie et al. reported that patients with abnormally elevated leukocytes were more likely to develop IVIG resistance for both complete and incomplete KD, and patients with CAL had significantly higher WBC than those without CAL 56 . www.nature.com/scientificreports/ In a scoring system developed in Taiwan, the percentage of monocytes can be used as an indicator to differentiate KD from febrile children 57 . Vasculitis in KD is characterized by granulomatous inflammation, and monocytes are involved in its formation. Infiltration of monocytes, abnormal activation of macrophages, and production of inflammatory factors and chemokines are involved in forming vascular lesions 58 . Rowley et al. observed that monocytes/macrophages infiltrated arteritis lesions in the autopsy case of KD 59 . Enhanced expression of toll-like receptor (TLR) 2 on monocytes was found in patients with KD and a mouse model of coronary vasculitis 60 . These suggest a pro-inflammatory role for monocytes in KD. Classical monocytes (CM), intermediate monocytes (IM), and non-classical monocytes (NCM) are the three major subpopulations of human monocytes that play pro-inflammatory, antigen expression, and antiviral roles, respectively 61,62 . CM with high SELL expression was significantly elevated in KD patients 63 . In contrast, IM and NCM subpopulations were significantly elevated in sepsis patients 64 . These showed that monocytes might play different roles in the pathogenic mechanisms of KD and sepsis.
Eosinophils are immune cells responsible for allergic reactions and parasite infections 65 . In our study, the eosinophil count was higher in KD than in sepsis and was a significant independent predictor for establishing a diagnostic model. Chih-Min Tsai et al. showed that the percentage of eosinophils (> 1.5%) was the most important independent predictor in a scoring system to differentiate KD from febrile infection, as did Liu Xiaoping Liu et al. 54,57 . In the last decade, eosinophils have been identified as potential sepsis biomarkers. Abidi and Shaaban's Colleagues successively revealed eosinophils' feasibility and sensitivity in diagnosing sepsis 66,67 . Eosinophilassociated T-helper (Th2) 2 mediators (IL-4, IL-5 and eotaxin) were increased after IVIG treatment, while Th2 is known to play an anti-inflammatory role 68 . In addition, it has been shown that KD patients with an increase in eosinophils have a decreased likelihood of being IVIG resistant and CALs formation 68,69 . These suggest that www.nature.com/scientificreports/ eosinophils may play an anti-inflammatory role in KD. Patients with KD had higher eosinophil counts before and after IVIG treatment than those with enterovirus 70 . The differences in eosinophil counts between KD and sepsis may indicate that the mechanisms of inflammatory responses in KD and sepsis are different. Also, studies had shown that patients in the KD group had significantly lower peripheral blood eosinophils than the incomplete KD group, which may help diagnose incomplete KD 71 .
Our study found that PA was a predictor compared to ALB. In contrast, the study by Liu et al. collected data only on ALB, one of our innovations 50 . Huang et al. found similarly to us that PA was more valuable than ALB for diagnosing KD, although both ALB and PA levels are reduced in KD patients 72,73 . Relative to ALB, PA has a shorter half-life and is more stable and sensitive than ALB in measuring liver function and malnutrition 74,75 . Research has shown that PA levels in the serum are associated with the prognosis of various diseases [76][77][78] . Li Zhang et al. found that PA has the following characteristics: (1) A reference value for healthy individuals can be established; (2) It changes significantly in KD patients; (3) IVIG treatment successfully returned to the almost average level 79 . Therefore, it can be used as a marker for the diagnosis and treatment monitoring of KD and the responsiveness to IVIG treatment. Lower AGR and hypoalbuminemia have been identified as independent predictors of CAL 73,[80][81][82] . More specifically, the 22nd Japanese KD Epidemiological Survey revealed that a 1 g/ dL reduction in ALB implied a 0.66-fold elevated risk of coronary artery dilation and a 0.34-fold increased risk of the coronary aneurysm 83 . However, these investigators did not collect PA-related information, and perhaps www.nature.com/scientificreports/ in the future, they may find a more significant role for PA relative to ALB in predicting the occurrence of CALs in patients with KD. Vasculitis due to KD can involve all medium-sized arteries and viscera, including the liver 84 . Liver pathology in patients with KD can be found with inflammatory cell infiltrates, Kupffer cell augmentation and/or swelling, fatty degeneration and stasis in the sinusoidal and portal vein regions 85,86 . In addition to CALs, hepatic insufficiency is a common manifestation during the acute phase of KD, as evidenced by elevated serum liver enzymes, bilirubin and reduced ALB 87,88 . 90.95% of children with KD presented with at least one liver function indicator abnormality, according to a retrospective study by Goshgar Mammadov and colleagues 80 . Natural killer cells are activated by cytokines, accumulate in inflammatory lesions, and converge on the vascular endothelium and hepatic sinusoids, resulting in hepatocellular injury and endothelial damage. These may be the causes of abnormal liver function in patients with KD 89 . Tremoulet et al. found that 62.7% of KD patients had increased GGT values and 40.3% had ascending ALT values 90 . A predictive model that differentiated KD from febrile illness in Taiwan found that ALT was more specific than aspartate aminotransferase (AST). However, GGT was not included in the study 57 . Nomograms established in the United States and Taiwan show that PLT and ALT are biomarkers for distinguishing KD from febrile disease 57,91 . In our study, according to statistical analysis, GGT was more specific than ALT in diagnosing KD.
An abnormal increase in PLT count is a feature of KD. A retrospective study showed that leukocytes, PLT, CRP, PCT, and other inflammatory mediators were remarkably increased in serum during the acute stage of KD 57 . Activation of PLTs is the first step when blood vessels are damaged, and the endothelium ruptures. In the meantime, PLTs are inflammatory effector cells involved in a range of events from acute inflammation to adaptive immunity 92 . Many receptors on the surface of PLTs frequently interplay with WBCs and endothelial cells. In vitro studies have shown that neutrophils partially depend on PLTs to potentiate fibrin deposits in the blood 93 . These all indicate the correlation between PLTs and vascular inflammation. Unlike Liu's study, our study found that PLTs were an independent predictor of distinguishing KD from sepsis 50 . According to a study in Sichuan, China, other parameters of PLTs, like mean platelet volume and platelet distribution width during fever, can help   94 . More studies are needed in the future to verify whether PLT parameters can be helpful in the diagnosis of KD. Furthermore, PLTs and their other parameters may also help diagnose patients with IVIG resistance. In the study of Gang Li et al., thrombocytopenia (< 300 × 10 9 /L) was significantly associated with IVIG resistance in KD patients 95 . Liu et al. suggested that the PLTs reduction in the KD patients with IVIG resistance may be related to the persistent depletion of PLTs due to coronary artery disease 96 . Recently, peripheral biomarkers of immunity/inflammation, neutrophil to lymphocyte count ratio (NLR), and PLR were identified as significant prognostic factors in KD patients with IVIG-resistant [97][98][99][100][101] .
Combined with previous studies on this topic, this article is the first known dynamic nomogram to aid clinicians in differentiating KD from sepsis. Due to using a continuous scale to calculate the probability of a specific outcome, this nomogram has higher accuracy and better identification than other clinical prediction tools or scoring systems. Moreover, this study added the patient's data (height, weight, age, BMI), easy to obtain but easier to ignore in clinical practice. In addition, we added biomarkers such as PA, GGT, LMR and E, which are rarely used to diagnose KD, and appeared to be more specific than ALT and ALB in our study.
However, our study has limitations: (1) Our study is a single-center retrospective article and lacks external validation, so selection bias cannot be ignored; (2) We did not collect data on PCT, IL-6, and erythrocyte sedimentation rate, which increased during the acute episode of KD and sepsis but were not included in the biochemical data. (3) The limitation of lasso regression is that it can drop one reasonably arbitrarily when two independent variables are highly correlated. We will reduce these limitations through further randomized controlled studies and additional external validation.
It is the first time to use a dynamic nomogram to develop a new predictive model that uses height, WBC, monocyte, eosinophil, LMR, PA, GGT, and PLT to help clinicians distinguish KD from sepsis accurately and efficiently.   Data collection. We collected data on 38 variables from epidemiological data, routine blood data, and biochemical test data. Epidemiological data include age, sex, height, weight, and BMI. Routine blood data includes WBC, neutrophil, lymphocyte, monocyte, eosinophil, RBC, hemoglobin (HB), HCT, mean vascular volume (MCV), mean corpuscular hemoglobin concentration (MCHC), RDW, PLT, NLR, LMR, and PLR were also calculated from routine blood data. Biochemical tests data includes total bilirubin (TBIL), direct bilirubin (DBIL), indirect bilirubin (IBIL), TP, ALB, globulin (GLB), AGR, PA, ALT, AST, alkaline phosphatase (ALP), GGT, lactate dehydrogenase (LDH), BUN, Na, Ca, iron (Fe), and CRP. All data were collected at the first visit before IVIG administration in patients with KD and before antibiotic treatment in patients with sepsis.

Definitions of KD and sepsis. The diagnosis of KD was made according to the 2017 American Heart
Association (AHA) criteria 12 . The diagnosis of classic KD is based on a fever ≥ 5 days and fulfilling at least 4 of the 5 main clinical features. The five clinical features include: (1) Erythema and cracking of the lips, strawberry tongue, and/or erythema of the oral and pharyngeal mucosa; (2) Bilateral bulbar conjunctival congestion without exudate; (3) Rash: maculopapular, diffuse erythrodermic, or erythema multiform; (4) Erythema and edema of the hands and feet in the acute phase and/or periungual desquamation in the subacute phase; (5) Enlarged cervical lymph nodes (≥ 1.5 cm in diameter), usually unilateral. Incomplete KD is diagnosed by a fever of more than 5 days with 2 or 3 consistent major clinical features and ≥ 3 additional laboratory findings or positive echocardiogram. Additional laboratory findings include anemia for age, PLT ≥ 450,000 after day 7 of fever, ALB ≥ 3.0 g/ dL, increased ALT, WBC ≥ 15,000/mm 3 , WBC/HPF ≥ 10 on urinalysis. Sepsis was defined in accordance with the 2016 Surviving Sepsis Campaign Guidelines 102 . Sepsis is diagnosed by signs and symptoms of inflammation and infection with hyperthermia or hypothermia (rectal temperature of 38.5 or 35 °C), tachycardia (which may not be present in hypothermic patients), and signs of altered function in at least one of the following organs: altered mental status, hypoxemia, increased serum lactate levels, or bounding pulses.
Statistical analyses. All continuous variables were not normally distributed after the normality test. We used the Wilcoxon rank sum test to analyze the quantitative variables. The Chi-square test and Fisher's exact test were applied to analyze the categorical variables. We randomly sampled the entire sample thousand times to build a prediction model that can help differentiate the KD from sepsis. We randomly selected 70% of the patients for the training set. Then the other 30% of the patient data was used for the testing set. Secondly, the significant variables are the intersection of LASSO and SVM. The ROC converted the selected continuous variables to the categorical variables when the AUC value was maximum. Thirdly, we built a prediction nomogram using a multiple logistic regression model to show each predictor's odds ratios and β factors. Data analysis was achieved by R software, version 4.1.2. P-values < 0.05 were considered statistically significant. Finally, we use the shiny platform to build a dynamic nomogram.
Guidelines and regulations statements. All

Data availability
The datasets used and/or analyzed during the current study are available from the corresponding author upon reasonable request. www.nature.com/scientificreports/ Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.