Machine learning and atherosclerotic cardiovascular disease risk prediction in a multi-ethnic population

The pooled cohort equations (PCE) predict atherosclerotic cardiovascular disease (ASCVD) risk in patients with characteristics within prespecified ranges and has uncertain performance among Asians or Hispanics. It is unknown if machine learning (ML) models can improve ASCVD risk prediction across broader diverse, real-world populations. We developed ML models for ASCVD risk prediction for multi-ethnic patients using an electronic health record (EHR) database from Northern California. Our cohort included patients aged 18 years or older with no prior CVD and not on statins at baseline (n = 262,923), stratified by PCE-eligible (n = 131,721) or PCE-ineligible patients based on missing or out-of-range variables. We trained ML models [logistic regression with L2 penalty and L1 lasso penalty, random forest, gradient boosting machine (GBM), extreme gradient boosting] and determined 5-year ASCVD risk prediction, including with and without incorporation of additional EHR variables, and in Asian and Hispanic subgroups. A total of 4309 patients had ASCVD events, with 2077 in PCE-ineligible patients. GBM performance in the full cohort, including PCE-ineligible patients (area under receiver-operating characteristic curve (AUC) 0.835, 95% confidence interval (CI): 0.825–0.846), was significantly better than that of the PCE in the PCE-eligible cohort (AUC 0.775, 95% CI: 0.755–0.794). Among patients aged 40–79, GBM performed similarly before (AUC 0.784, 95% CI: 0.759–0.808) and after (AUC 0.790, 95% CI: 0.765–0.814) incorporating additional EHR data. Overall, ML models achieved comparable or improved performance compared to the PCE while allowing risk discrimination in a larger group of patients including PCE-ineligible patients. EHR-trained ML models may help bridge important gaps in ASCVD risk prediction.


Supplementary
*GBM models presented here were trained on all variables and used iterative imputation to fill in missing variables.
**The F1 Score was weighted to account for class imbalance.    This note provides additional details regarding machine learning training, cross-validation, and test approach.
First, the full dataset was randomly split by patient into an 80% training/validation set, and a 20% held-out test set. This 80/20 split is a standard convention, and since our dataset is large, we expect the 20% held-out test set to be a representative sample of the full dataset. This heldout test set was only defined once, at the beginning of the study.
The following steps were repeated twice, with appropriate modifications: first, using only variables which are used as inputs to the PCE, and second, using the additional variables extracted from the EHR.
Next, we used 5-fold cross-validation to tune hyperparameters using all of the variables extracted from the EHR. For this first hyperparameter tuning phase, we used a grid-search approach with a restricted set of the hyperparameter values shown in Supplementary Table 9. Additional details of the roles of individual hyperparameters can be found at https://scikitlearn.org/0.21/. We fixed the cross-validation folds a priori, so each ML model was trained and validated on the same sets of data for each fold. For each algorithm (Logistic Regression, Lasso, Random Forest, Gradient Boosting Machine, Extreme Gradient Boosting), the model with the highest average AUC across the 5 folds was selected from the hyperparameter grid search. For this phase, simple imputation was used to fill in missing variables.
We then used the best RF, GBM, XGBoost, and LRLasso model for feature selection, as the logistic regression model had poor cross-validated performance. This step was not needed for the analysis which only considered the PCE variables. We defined a composite score by examining the default feature rankings from each model. For RF, GBM, and XGBoost, the mean decrease in impurity (MDI) was used to assess feature importance. Although the MDI has some shortcomings regarding variable types (binary, categorical, and continuous) and missingness levels, we were primarily concerned with removing binary variables. For the LRLasso model, a feature's importance was determined by multiplying the absolute value of the coefficient by the standard deviation of the variable. For each of the four models, the most important variable was assigned a rank of 1, the second most important variable was assigned "2", etc. The composite score for each variable was defined as the minimum rank across the four models. Any feature which had a composite score of greater than 100 was excluded from further analysis; a second round of model training was done and composite scores were recalculated, and any feature which had a composite score of greater than 50 was excluded from further analysis.
With the restricted variable set, we then performed a broader hyperparameter grid search for each algorithm, using all of the parameter values in Supplementary Table 9, again using the pre-selected cross-validation folds within the training set. In addition to these hyperparameters, we also tested whether iterative imputation or mean value imputation resulted in a higher crossvalidated AUC. The imputation was done within the cross-validation folds; in other words, a separate imputer was trained on each of the training CV folds ("training" the mean value imputer simply consisted of calculating the mean for each variable within that fold) and then run on the held-out CV. Boolean missing flag variables were added for each variable that had missing data.
Once the best imputer was chosen and the new hyperparameters were set for each model, oversampling of the minority class (patients who developed ASCVD in the 5-year follow-up period) was tested at different rates. This consisted of randomly resampling individuals in the minority class, with replacement, until a desired number of observations with the outcome were included in the dataset. The numbers tested were 2x, 5x, and 10x the minority class size.
Oversampling was only done on the training folds, not on the evaluation folds, so that the crossvalidated AUCs are comparable across all oversampling rates.
After all these steps were completed, the best-performing pipeline (consisting of the best imputer, the best oversampling rate, and the best model hyperparameters) was retrained on the entire training dataset. This fully trained pipeline was then used to predict outcomes on the heldout test set. We report held-out test set results, including AUC, sensitivity, specificity, precision, and F1-score weighted for class imbalance (the F1 score for the majority class and minority class were calculated separately, and we report the average).
Once this process was completed for the entire cohort, we added an extra round of hyperparameter tuning on several sub-cohorts: PCE-eligible patients, all patients aged 40-79 (including those with missing or out-of-range PCE variables), Hispanic patients, Asian patients, and Non-Hispanic White (NHW) and African American (AA) patients (considered together). This hyperparameter tuning was again done using the predefined 5 cross validation folds (which weren't equally sized for some of these sub-cohorts), and the final metrics are reported on the corresponding patients in the held-out test set. For patient populations which had no missing data (e.g. PCE-eligible patients), imputation was not needed, and the boolean missing flag variables were not created.
As a comparison to the PCE, which was primarily validated in NHW and AA patients, we also used the models trained only on NHW and AA patients to predict outcomes for Hispanic and Asian patients. Results are shown in Supplementary Table 6.