Introduction

Hematoma expansion occurs in one third of patients with acute intracerebral hemorrhage (ICH), and has been identified as a factor associated with early neurologic deterioration and poor outcome1,2,3,4,5. Therefore, its accurate prediction on admission assists in developing appropriate patient management strategies. Various predictive factors for hematoma expansion have been suggested, including time from onset to baseline imaging, older age, antiplatelet use, anticoagulant use, ICH volume on baseline imaging, and CT markers such as intrahematoma hypodensities, irregular hematoma shape, blend sign, and CT angiography spot sign3,4,5,6,7,8,9,10,11,12,13,14,15. Additionally, several predictive scores that combine those factors have been reported16,17,18,19,20.

Machine learning (ML) approaches have been used in clinical studies, and perform well in disease detection, outcome prediction, and classification of various medical data21,22,23,24. To apply the study results using ML to clinical practice, there are some important points to be considered. The Radiological Society of North America developed a list of key considerations of ML research: it emphasized the generalizability of the research work and the reproducibility of the work’s results25,26. However, many clinical studies using ML lack those perspectives: for example, single-vendor images are used in imaging analysis, and ML algorithms are not publicly available25,26.

To develop accurate, generalizable and widely applicable predictive models of hematoma expansion in acute ICH, we applied ML algorithms to clinical data and CT findings on admission. Multicenter data and multivendor CT images were used, and the algorithms were made available via a website.

Materials and methods

Study population

Consecutive patients with acute ICH who were admitted to Mie Chuo Medial Center between December 2012 and July 2020, Matsusaka Chuo General Hospital between January 2018 and December 2019, Suzuka Kaisei Hospital between October 2017 and October 2019, and Mie University Hospital between January 2017 and July 2020 were retrospectively reviewed. Patients in Mie Chuo Medical Center, Matsusaka Chuo General Hospital, and Suzuka Kaisei Hospital were assigned to the development cohort, and those in Mie University Hospital were assigned to the validation cohort.

Inclusion criteria were defined as follows: ≥ 18 years of age; baseline CT scan within 24 h of onset; and follow-up CT scan within 30 h after baseline CT scan. Exclusion criteria were defined as follows: traumatic ICH; secondary cause of ICH (e.g., aneurysm, arteriovenous malformation, arteriovenous fistula, hemorrhagic transformation of infarction, and tumor); and surgical evacuation before follow-up CT scan.

Baseline clinical variables included age, sex, medical history (ICH, cerebral infarction, ischemic heart disease, hypertension, diabetes mellitus, and dyslipidemia), anticoagulant use, antiplatelet use, Glasgow Coma Scale, systolic and diastolic blood pressures, prothrombin time-international normalized ratio (PT-INR), white blood cell count, hemoglobin, platelet count, serum creatinine, serum total bilirubin, and time from onset to baseline CT scan.

This study was approved by the following institutional review boards: Mie Chuo Medical Center institutional review board [permit number: MCERB-201926], Matsusaka Chuo General Hospital institutional review board [permit number: 232], Suzuka Kaisei Hospital institutional review board [permit number: 2020–05], and Mie University Hospital institutional review board [permit number: T2019-19]. Because this was a retrospective study, separate informed patient consent was waived by the following institutional review boards: Mie Chuo Medical Center institutional review board [permit number: MCERB-201926], Matsusaka Chuo General Hospital institutional review board [permit number: 232], Suzuka Kaisei Hospital institutional review board [permit number: 2020–05], and Mie University Hospital institutional review board [permit number: T2019-19]. All study protocols and procedures were conducted in accordance with the Declaration of Helsinki. This manuscript was prepared according to the standards for reporting of diagnostic accuracy (STARD) statement.

Imaging analysis

CT scans were performed using 120 kVp with a thickness of 0.5–10.0 mm in the supine position. CT angiography was performed by injecting 50–100 ml of an iodinated contrast material at 3.5–5.0 ml/s; but not all patients underwent CT angiography. Manufacturers and models of CT scanners in the development cohort included Aquilion ONE (Canon Medical Systems, Ohtawara, Japan), Aquilion 64 (Canon Medical Systems), LightSpeed Plus (GE Medical Systems, Milwaukee, WI, USA), LightSpeed VCT (GE Medical Systems), BrightSpeed Elite (GE Medical Systems), and SOMATOM Definition Flash (SIEMENS Healthineers, Erlangen, Germany), and those in the validation cohort included Aquilion 64 and Discovery CT750 HD (GE Medical Systems).

The hemorrhage locations were categorized as basal ganglia, thalamus, lobe, brain stem, and cerebellum. The presence of intraventricular extension of hemorrhage was noted. The hematoma volume was calculated with the ABC/2 formula27. Hematoma expansion was defined as an increase in volume between baseline and follow-up CT scans exceeding 6 cm3 or 33% of the baseline volume16,17,18,19,20,28.

Intrahematoma hypodensities, irregular hematoma shape, and blend sign were identified as noncontrast CT markers. Intrahematoma hypodensities were defined as presence of any hypodense region encapsulated within the hematoma having any morphology and size, separated from the surrounding parenchyma3,4,12,14. Irregular hematoma shape was defined as presence of 2 or more hematoma edge irregularities4,7,9,12. Blend sign was defined as blending of relatively hypoattenuating area with adjacent hypoattenuating region within a hematoma with a well-defined margin and at least 18 Hounsfield units difference from these regions4,6,8,12. When available, CT angiography spot sign was evaluated, which was defined as follows: (1) ≥ 1 focus (attenuation ≥ 120 Hounsfield units) of any size and morphology of contrast pooling within a hematoma, and (2) discontinuous from normal or abnormal vasculature adjacent to the hematoma15,29. The CT markers were independently evaluated by 2 observers. When the evaluation by observers disagreed, the CT images were re-evaluated by both observers together, with consensus being developed.

Inhospital management

After identification of ICH on baseline CT scan, continuous blood pressure monitoring and blood pressure-lowering treatment were initiated. Calcium channel blockers, mainly intravenous nicardipine, were administered as antihypertensive agents throughout the period between baseline and follow-up CT scans. The target systolic blood pressure was less than 140 mmHg or 180 mmHg.

Statistical analysis

Continuous variables were summarized using a mean with standard deviation or a median with interquartile range and compared using Student’s t test or Mann–Whitney U test, depending on the distribution of the variable assessed by the Shapiro–Wilk test. Categorical variables were summarized using a count with percentages and compared using Fisher’s exact test.

To confirm the superiority of predictive models using ML over the previous scoring methods, the BAT, BRAIN, and 9-point scores in the validation cohort were calculated16,17,18,19. The receiver operating characteristic (ROC) curve was drawn, where the best cutoff value by the Youden’s index was determined. In each scoring method, accuracy, sensitivity, specificity, and the area under the ROC curve (AUC) for the prediction of hematoma expansion were computed. The AUC of the three scores and that of ML models were compared using DeLong test.

All statistical analyses were performed using EZR (Saitama Medical Center, Jichi Medical University, Saitama, Japan)30, which is a graphical user interface for R (The R Foundation for Statistical Computing, Vienna, Austria).

Machine learning environment and algorithms

The programming language Python (version 3.7.8) and its libraries, NumPy (version 1.19.1), scikit-learn (version 0.23.2), XGBoost (version 1.2.0), imbalanced-learn (version 0.7.0), and matplotlib (version 3.3.1), were used for all data processing. The programming code was executed in Jupyter Notebook (version 6.0.3).

To develop predictive models, supervised ML algorithms were adopted, in which pairs of the input data and the output class were given to the algorithm, which found a way to generate the output class from the input data31. The k-nearest neighbors (k-NN) algorithm, logistic regression, support vector machines (SVMs), random forests, and XGBoost were selected as the supervised algorithms. The k-NN algorithm is the simplest ML algorithm, which finds k neighbors closest to a new observation in the stored training data and makes a prediction by assigning the majority class among these neighbors31. Logistic regression is a binary classifier, in which a linear model is included in a logistic function and the probability that a new observation is a member of each class is computed31. SVMs find the hyperplane that maximizes the margin between classes in the training data, making a prediction based on the distances to the support vectors and the importance of support vectors31. Random forests train many decision trees, where each tree only receives a bootstrapped observation of training data and each node only considers a subset of features when determining the best split, making a prediction in accordance with the averaged probabilities predicted by all the trees31. XGBoost is a gradient boosting algorithm, which works by building decision trees in a serial manner, where each tree tries to correct the mistakes of the previous one; and the probability is computed by summing the weight of the leaves to which a new observation belongs in each decision tree31. With each supervised algorithm, predictive model development using the patent data of the development cohort (training data set) and external validation using that of the validation cohort (test data set) were planned.

Feature selection and scaling, and oversampling

Baseline clinical variables, CT findings including hemorrhage locations, intraventricular hematoma extension, baseline hematoma volume, and noncontrast CT markers, and target systolic blood pressure were applied as the input data, while hematoma expansion was applied as the output class.

Since there were 31 individual properties of the input data, which were called features, feature selection was performed to lead to simpler models that generalize better31. Firstly, univariate analyses with Student’s t test, Mann–Whitney U test, and Fisher’s exact test were performed between expansion and no expansion groups in the training data set. Secondly, the features were ranked in accordance with their P values. Finally, 5 to 10 features with the smallest P values were selected. Feature scaling was performed using standardization in SVMs, which required all the features to vary on a similar scale to perform well.

Given the imbalance of the output class distribution, random oversampling was employed. Random oversampling involved randomly selecting observations from the minority group with replacement and adding them to the training data set.

Predictive model development and external validation

Each supervised ML algorithm was applied to the training data set with 5 to 10 selected features and all 31 features. In the predictive model development process, stratified 30-fold cross-validation was used to assess generalization performance, in which the training data set was split such that the proportions between output classes were the same in each fold as they were in the whole training data set31. The hyperparameters were tuned manually in each algorithm as shown in Table 1 to improve generalization performance, while the other hyperparameters not listed in Table 1 were used as default.

Table 1 Manually tuned hyperparameters and their values in each machine learning algorithm.

After the model development, each model was evaluated for its performance on the test data set as external validation, where accuracy, sensitivity, specificity, and the AUC for the prediction of hematoma expansion were computed.

Results

After application of the inclusion and exclusion criteria, 351 of 930 patients in the development cohort and 71 of 212 patients in the validation cohort were evaluated (Fig. 1). Hematoma expansion occurred in 71 patients (20.2%) in the development cohort and in 26 patients (36.6%) in the validation cohort (Table 2).

Figure 1
figure 1

A flow chart indicating the included and excluded patients in the development cohort (a) and the validation cohort (b). ICH indicates intracerebral hemorrhage.

Table 2 Characteristics of the development and validation cohorts.

On comparison between expansion and no expansion groups in the development cohort, 10 variables with the smallest P values were baseline hematoma volume, intrahematoma hypodensities, PT-INR, anticoagulant use, lobar hemorrhage, irregular hematoma shape, platelet count, sex, time from onset to baseline CT scan, and cerebellar hemorrhage in increasing order (Table 3): these were used as selected features.

Table 3 Univariate analyses between expansion and no expansion groups in the development cohort.

The k-NN algorithm achieved the highest AUC of 0.790 (95% confidence interval [CI], 0.693–0.886) among all ML models, where 9 selected features were used and the hyperparameter n_neighbors was 5 (Table 4). Logistic regression yielded the AUC of 0.674 (95% CI, 0.563–0.784) when 6 selected features were used, and C was 0.1. SVMs yielded the AUC of 0.740 (95% CI, 0.634–0.846) when all 31 features were used, and C and gamma were 1 and 0.01, respectively. Random forests yielded the AUC of 0.741 (95% CI, 0.633–0.849) when all 31 features were used, and n_estimators and max_depth were 125 and 3, respectively. XGBoost yielded the AUC of 0.732 (95% CI, 0.623–0.841) when 9 selected features were used, and num_round, eta, max_depth, min_child_weight, colsample_bytree, subsample, gamma, and alpha were 20, 0.1, 4, 4, 0.9, 0.8, 0.1, and 0, respectively.

Table 4 Test characteristics of previously reported scoring methods and machine learning models in the validation cohort.

The best cutoff values in the previous scoring methods were 3 in the BAT score, 9 in the BRAIN score, and 4 in the 9-point score. Although the BRAIN score achieved the highest AUC of 0.676 (95% CI, 0.579–0.772) among all previous scoring methods, the k-NN algorithm that achieved the best performance of all ML models showed higher AUC than the BRAIN score (0.790 vs. 0.676; p = 0.016) (Table 4).

Discussion

We developed and validated ML predictive models of hematoma expansion in acute ICH. The models demonstrated good predictive ability, showing better performance than the previous scoring methods. Multicenter data and multivendor CT images were used for model development, so that the models were generalizable and widely applicable.

Thirty-one features, consisting of baseline clinical variables, CT findings, and target systolic blood pressure, were put into the model development process. Clinical variables only contained general patient information and blood test findings. Thus, they could be easily collected in clinical practice. All CT findings were obtained from noncontrast CT scans; and CT scan data included those performed with a thickness of 0.5–10.0 mm. Although the spot sign, which is also included in the 9-point score, is useful for predicting hematoma expansion, CT angiography is available in a limited number of hospitals. Additionally, although noncontrast CT markers are usually evaluated with a thickness of 5.0 mm, in clinics or developing countries, CT scans are not uncommonly performed with a thickness of more than 5 mm. Therefore, in order that predictive models could be used in many hospitals and countries, we acquired and analyzed CT scan data for such conditions. We experimentally included target systolic blood pressure in the features because it could be determined at admission. However, there was no statistical difference regarding target systolic blood pressure between expansion and no expansion groups in the development cohort. Therefore, target systolic blood pressure was not included in the features of the best ML model.

Feature selection was performed to develop simpler ML predictive models. When developing models using many features, or a high-dimensional data set, models become complex and the chance of overfitting increases31. There are three basic strategies for selecting features: model-based selection, iterative selection, and univariate analysis31. Model-based selection utilizes supervised ML models such as linear models and decision tree-based models to judge the importance of each feature. In iterative selection, a series of models for feature selection are built, where the features with higher importance are selected. These methods consider all features at once and may be able to capture interactions between features. However, when the performance of the models for feature selection is low, selected features could be unreliable. Univariate analysis was the one that we chose in this study, where a correlation between individual features was ignored and therefore features that were only informative when combined with other features were discarded. Still, we showed good performance in the best ML model using univariate analysis, but there may be better feature selection methods. However, there is one caveat: elaborate feature selection may lead to overfitting, resulting in reducing model performance.

We have made the raw data and the programming code of ML algorithms available on the websites to ensure reproducibility of the developed models: we believe that this is the most important point for the clinical studies using ML. There may be better ML approaches than what we have shown in this study, and better ML algorithms that can achieve higher performance may be created in the future. By using ML approaches, we can easily add the data of other facilities and develop more robust and reliable ML models. With the maturity of ML technology and its usage environment, it is becoming easier for clinicians to learn ML and apply it to clinical research. We hope that our data and algorithms will be widely used and applied to new analyses.

ML approaches have been used in medical research and often perform better than classical statistical models21,22. In this study, even though there were some statistical differences in patient characteristics between the development and validation cohorts (Table 2), the developed ML models showed better predictive ability than the previous scoring methods, such as the BAT, BRAIN, and 9-point scores, in the validation cohort16,17,18,19.

Several clinical studies have investigated the relationships between lowering of blood pressure and the outcome in patients or hematoma expansion though no conclusion has been reached yet32,33,34,35. However, the ultra-early lowering of blood pressure may benefit patients with acute ICH36. Moreover, anticoagulant reversal may reduce hematoma expansion37. The developed ML models in this study may be useful, especially in ultra-early phase or when anticoagulants are given, for selecting patients who require more careful treatment.

A few limitations should be noted. First, more patients in the development and validation cohorts are needed to achieve more robust quality and more satisfactory performance of ML predictive models. It is hard to determine the appropriate number of patients in ML analyses because it depends on the quality of the input data. However, efforts are required to increase the number of patients and to make sure that model performance have reached a plateau irrespective of an increase of the number of patients38. Second, CT findings were evaluated by humans. If we utilize an artificial neural network for analyzing CT scan data, we can create hybrid models that unify analyses of imaging data and clinical variables within a ML pipeline. The hybrid models are likely to achieve higher predictive performance. However, as a serious problem, brain image data usually contain face information, which cannot easily be shared.

In conclusion, we developed widely applicable predictive models of hematoma expansion in acute ICH by applying ML algorithms to clinical data and noncontrast CT findings. The models showed better performance than the previous scoring methods. We have made the raw data and the programming code available on the websites so that anyone can utilize and improve the models.