Prediction of venous thromboembolism with machine learning techniques in young-middle-aged inpatients

Accumulating studies appear to suggest that the risk factors for venous thromboembolism (VTE) among young-middle-aged inpatients are different from those among elderly people. Therefore, the current prediction models for VTE are not applicable to young-middle-aged inpatients. The aim of this study was to develop and externally validate a new prediction model for young-middle-aged people using machine learning methods. The clinical data sets linked with 167 inpatients with deep venous thrombosis (DVT) and/or pulmonary embolism (PE) and 406 patients without DVT or PE were compared and analysed with machine learning techniques. Five algorithms, including logistic regression, decision tree, feed-forward neural network, support vector machine, and random forest, were used for training and preparing the models. The support vector machine model had the best performance, with AUC values of 0.806–0.944 for 95% CI, 59% sensitivity and 99% specificity, and an accuracy of 87%. Although different top predictors of adverse outcomes appeared in the different models, life-threatening illness, fibrinogen, RBCs, and PT appeared to be more consistently featured by the different models as top predictors of adverse outcomes. Clinical data sets of young and middle-aged inpatients can be used to accurately predict the risk of VTE with a support vector machine model.

The selection of the best model. The results of the training and testing subsets are shown in Table 2.
The results of the training subsets agreed well with those of the testing subsets in the support vector machine (SVM) and feed-forward neural network (nnet) models. Slight underfitting appeared in the generalized linear method (GLM) and decision tree (RPART) model, while overfitting appeared in the random forest (RF) model. The cross-validated area under the receiver operator characteristic (ROC) curve (cvAUC) generated with different models with estimated 95% confidence intervals in the testing subsets is shown in Fig. 3A, and consensus ROC curves in the testing subsets generated with different models are shown in Fig. 3B. Representative confusion matrices are shown in Table 3. It was clear that all methods except for the decision tree (RPART) yielded very similar consensus ROC areas. The SVM model achieved stable and good performance for both evaluation methods with AUC values of 0.806-0.944 for 95% CI, 59% sensitivity and 99% specificity, and 87% accuracy.
Variable rankings of the models. The top 4 variables in each model are shown in Table 4. Life-threatening illness, fibrinogen, RBCs, and PT appeared to be more consistently featured by the different models (≥ 3) as top predictors of adverse outcomes. In addition, these factors were considered strong predictors by both the SVM and RPART models. In particular, life-threatening illness and fibrinogen were consistently chosen as the top predictors of adverse outcomes by 4 models, and PT was selected by the SVM, RPART and RF models as having the highest importance. In addition, SHapley Additive exPlanation (SHAP) values of each feature within the SVM model are shown in Fig. 4.

Discussion
Machine learning may reduce the workload of clinicians, change diagnostic procedures, and reduce medical costs 16 . In this study, we attempted to develop a preliminary machine learning model for predicting VTE in young and middle-aged hospitalized patients. The results showed that SVM was the most accurate algorithm to predict VTE with the highest average AUC and superior statistical performance. An analysis of variables with different models showed that life-threatening illness, fibrinogen, RBCs, and PT appeared to be consistently featured by the different models as top predictors of adverse outcomes.
The goal of machine learning algorithms is to search for a linear or nonlinear function for classification or prediction 17 . Logistic regression, also known as logistic regression analysis, is a generalized linear regression analysis model 18 . The nnet is a form of supervised machine learning in which the data to be learned are neither sequential nor time-dependent 19 . RF employs decision trees to construct a predictive model on various subsamples of the dataset and uses the average value to improve the predictive accuracy and control overfitting 20 . SVM is a data classification method that involves multidimensional data sorting based on a hyperplane 21 . Decision tree (RPART), using a tree-like graph and possible consequences to classify features, is a graphic method to intuitively use probability analysis for classification or regression tasks 22 . However, the performance of machine learning algorithms varies with different data sets, and no algorithm can achieve good performance in all possible learning problems 23 . In general, the AUC value range is 0.5-1.0, with values between 0.5 and 0.7 indicating low discrimination ability, values between 0.7 and 0.9 indicating moderate discrimination ability, and values > 0.9 indicating high discrimination ability 24 . In our study, the cross-validated areas under the ROC curve were calculated to assess the accuracy of the predictive power of the models by using the cvAUC function with the tenfold cvAUC library 25 . The cvAUC values of 0.810, 0.752 and 0.868 for the training sets of GLM, RPART, and nnet, respectively, are an indication that the three models have moderate discrimination abilities. The cvAUC values of 0.904 and 1 in the training sets of SVM and RF are an indication that the two models have high discrimination ability. The generalization performance is a very important aspect in the application of machine learning algorithms. Overfitting leads to poor generalization of these models 26 . Here, overfitting occurred in the RF. Therefore, RF is not an appropriate model, although it has the best performance on the training set. In the other four models, SVM achieved the best performance (the highest cvAUC value). The confusion matrix is another widely used method www.nature.com/scientificreports/ to evaluate classification results [27][28][29] . The confusion matrix analysis found that the GLM and SVM models had the best and second-best performance (in terms of accuracy, sensitivity, and specificity), respectively. Based on the results of the two evaluation methods, SVM may be the most stable and accurate method for predicting the risk of VTE in young and middle-aged hospitalized patients. In our study, the top predictors of adverse outcomes consistently featured by the different models were lifethreatening illness, fibrinogen, RBCs, and PT (which appeared in 3 models as top predictors). Haemoglobin, www.nature.com/scientificreports/ prophylactic treatment, digestive tract ulcer, CVC or PICC insertion, history of DVT, and history of PE were also featured by one of these methods. Some of these risk factors have been confirmed to be related to VTE. For example, CVC or PICC insertion and a history of DVT and PE have been extensively investigated as highrisk factors for VTE 30,31 . In addition, life-threatening illness and fibrinogen have been confirmed by a recent meta-analysis to be related to the risk of VTE, and these factors are mainly related to the occurrence of VTE in the elderly 32 . However, the relationship of other factors, such as PT, haemoglobin and RBCs, has seldom been studied and is usually not associated with VTE in the elderly 32 . Recent studies confirmed that PT was an independent risk factor for prostatic tumours in the perioperative period with VTE or COVID-19-related thrombotic complications 33,34 . In addition, the relationship between red blood cells and VTE has been gradually realized 35,36 . Even more interesting is that haemoglobin has been reported to be associated with VTE risk in cancer patients 37 .
However, there is no significant relationship between haemoglobin and VTE in elderly diabetic patients 38 . Our   40 . In the Kucher model and the Padua prediction curve (cvAUC) generated with different models with estimated 95% confidence intervals. (B) Consensus ROC curves generated with different models. Yellow is generalized linear, black the support vector machine, red the decision tree, green the neural network, and blue the random forest model. GLM generalized linear method, SVM support vector machine, nnet feed-forward neural network, RF random forest. Table 3. Confusion matrices in different models. Results from analysis performed with the whole testing set. Sens refers to sensitivity at detecting a composite outcome (true pos/[true pos + false neg]). Spec refers to specifcity at excluding a composite outcome (true neg/[true neg + false pos]), and acc refers to the accuracy of the assignment. GLM generalized linear method, SVM support vector machine, nnet feed-forward neural network, RF random forest, neg negative, pos positive. www.nature.com/scientificreports/ score, elderly age was considered a high risk factor for VTE. In the Caprini model, age was subdivided into 40-60, 61-74 and 75+ 30 . The study showed that the incidence of VTE strongly increases with age, which may be explained by the biology of ageing rather than by exposure to an increased number of VTE risk factors 14 . To date, we have not found any research to evaluate the effects of these models in young and middle-aged hospitalized patients. In addition, the performance of the PESI score and wells score model in predicting PE in young and middle-aged patients is poor 11,41 . Based on the different risk factors faced by patients of different ages, the above information demonstrated that it is necessary to evaluate patients of different ages separately. Therefore, the prediction model in our study will contribute to the prevention and management of VTE in young-middleaged patients.

Strengths and limitations.
The main strength of our study was that our clinical data covered various diseases in young and middle-aged people. Additionally, the study compared and analysed the performance of five machine learning techniques for VTE. This comparison and analysis enabled a comprehensive understanding of the risk factors for VTE in young and middle-aged people and increased our confidence in our conclusions. This study has several limitations. First, we developed the VTE model using clinical data, mainly including biochemical indicators, but did not consider other factors, such as environmental factors and genetic factors (VTE-associated genes). As a retrospective study, the selection of VTE cases and controls might result in potential selection bias 42 . Second, most of the factors included in the study were dichotomous variables rather than continuous variables, without considering the relationship between the exposure levels of these risk factors and VTE, which may hide their true relationships with VTE. Third, the risk factors predicted by different machine learning techniques are different, which caused confusion. Further study should determine the predictive value of these risk factors for VTE in young-middle-aged inpatients. Fourth, it was not possible to conduct external validation of these models due to the lack of available unique datasets at this time, so the generalization abilities of the models for other populations are still unknown. www.nature.com/scientificreports/

Conclusions
This is the first study using machine learning techniques to estimate the VTE risk for young-middle-aged inpatients. Our study confirmed that the new SVM model-predicted risk probability is helpful for care providers as it guides the management and prevention of high-risk young and middle-aged inpatients.

Methods
Study design and patients. The study was conducted using data for all patients who were residents of all medical departments of China-Japan Union Hospital (Jilin University, Changchun, Jilin Province, China). The data for patients who were ≤ 45 years of age and with a ≥ 2-day duration of hospitalization were included. Patients who (i) had VTE on admission, (ii) ≤ 18 years of age, (iii) were pregnant, (iv) lacked major indications and experimental data (more than 7 parameters were missing), and (v) had uncertainties in the acquisition time for laboratory indicators were excluded. Initially, data for VTE and non-VTE patients were first collected from patients between January 2017 and October 2018. Next, to solve the class imbalance problem caused by the small amount of data of patients with VTE 43 , VTE cases between January 2019 and December 2020 were also included. Covariates. Data on comorbidities, physical findings and laboratory and medication data were retrieved from the medical records of the hospital. Thrombosis was only recorded during hospitalization. Variables included the following: age (age ≤ 45 years), sex, hypertension, myocardial infarction, peripheral vascular diseases (vascular occlusion angeitides, Buerger disease, external jugular venous aneurysm, femoral arteriovenous fistula, popliteal artery injury, bilateral femoral artery injury, lower extremity artery injury, oesophageal and gastric varices, lower limb varicosity, lymphedema, hepatic haemangioma and intermuscular haemangioma), cerebrovascular disease (ischaemic vascular disease, haemorrhagic cerebrovascular disease and intracranial arteriovenous malformations), active inflammation (acute and chronic inflammation except for phlebitis and vasculitis), rheumatoid disease (rheumatoid arthritis, rheumatic heart disease and ankylosing spondylitis), immune system diseases (allergic dermatitis, purpura dermatosis, systemic lupus erythaematosus), digestive tract ulcer, diabetes without complications, diabetes with complications (diabetic ketoacidosis, diabetic peripheral neuropathy and diabetic ketoaciduria), renal disease, hemi-or paraplegia, mild liver disease (fatty liver, hepatic haemangioma, hepatic cyst, intrahepatic bile duct stone), moderate to severe liver disease (abnormal liver function, liver cirrhosis and hepatitis B), active cancer (admission for a cancer diagnosis or for chemotherapy), history of DVT (history of upper or lower-extremity DVT within 30 days), history of PE (within 30 days), history of any VTE event (except for the DVT and/or PE), life-threatening illness (any condition that ICU admission or transfer is required during hospitalization), history of prior CVA/TIA (cerebrovascular accident, transient ischaemic attack), CVC or PICC insertion, surgery type, prophylactic treatment, haemostatic treatment, triglyceride, total cholesterol, activated partial thromboplastin time (APTT), prothrombin time (PT), fibrinogen, white blood cell count (WBC), red blood cell count (RBC), haemoglobin, platelet, and C-reactive protein (CPR). For nonsurgical inpatients, the first laboratory index after admission was used. For hospitalized patients who underwent surgery, the laboratory index was the first laboratory examination index after the first surgery. Patients with VTE occurring before surgery were treated as nonsurgical patients. The data for variables before VTE onset were used. For categorical variables, if there was corresponding information in the medical record, they were assigned according to the corresponding information; if there was no corresponding information, they were considered normal health.  44 . The data were cleaned by the many NAs method in the DMwR package 45 . The missing continuous data were imputed by the knnImputation method in the DMwR package with a k value of 10. Then, the subjects were randomly assigned at a ratio of 75:25 by the create Data Partition method in the CARET package 46 into a training set (n = 431) for variable determination and model construction and a test set (n = 142) to test the model performance. The details of the variables are shown in Box 1. Five algorithms, including logistic regression, decision tree, feed-forward neural network, support vector machine, and random forest, were used for training and preparing the models. The generalized linear method (logistic regression) model used the GLM method in the stats package 44 . A univariate logistic regression analysis was performed initially to identify significant variables (features). All significant variables with < 5% significance from univariate analysis were entered into the multiple logistic regression model using stepwise elimination to determine final variables. Other machine learning methods for decision tree, feed-forward neural network, support vector machine, and random forest models used RPART, nnet, SVM Radial, and RF methods in the CARET package, respectively. The recursive feature elimination method in the CARET package was used to identify the combination of optimal features for each machine learning model 47,48 . Tenfold cross-validation was used to minimize the overfitting or feature selection bias in the model [49][50][51] . To obtain the best performance of the models, the parameter cp was tuned for RPART, size and decay for nnet, sigma and C for SVM Radial, and mtry for RF. www.nature.com/scientificreports/ by the rf method with mtry = 2; and the nnet model was constructed by the nnet method with size = 1 and decay = 1e−04. The varImp function of the CARET package was used to calculate the importance of variables in each model, and the first four variables with the highest scores were considered the top variables of the model. The full feature importance graph of the SVM model was constructed by using Scott M. Lundberg's method 52 . Model comparisons. For model evaluation and validation, the cross-validated area under the receiver operator characteristic (ROC) curve (cvAUC) was determined with 10 parts in testing sets created by the create folds method in the CARET package using the method of LeDell et al. 53 . The ROC curve threshold in the calculation process was the default value of the cvAUC method in the cvAUC package 53 . The consensus ROC curve for each model was performed by using the cvAUC method in the cvAUC package. The confusion matrixes of each model in the testing sets were also used to evaluate the accuracy of the models. www.nature.com/scientificreports/ Publisher's note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Ascertainment of outcomes.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.