Machine-learning-based prediction models for high-need high-cost patients using nationwide clinical and claims data

High-need, high-cost (HNHC) patients—usually defined as those who account for the top 5% of annual healthcare costs—use as much as half of the total healthcare costs. Accurately predicting future HNHC patients and designing targeted interventions for them has the potential to effectively control rapidly growing healthcare expenditures. To achieve this goal, we used a nationally representative random sample of the working-age population who underwent a screening program in Japan in 2013–2016, and developed five machine-learning-based prediction models for HNHC patients in the subsequent year. Predictors include demographics, blood pressure, laboratory tests (e.g., HbA1c, LDL-C, and AST), survey responses (e.g., smoking status, medications, and past medical history), and annual healthcare cost in the prior year. Our prediction models for HNHC patients combining clinical data from the national screening program with claims data showed a c-statistics of 0.84 (95%CI, 0.83–0.86), and overperformed traditional prediction models relying only on claims data.


INTRODUCTION
Rapidly growing healthcare spending has become one of the significant challenges in many developed countries 1 . Existing evidence indicates that healthcare spending is concentrated among a small number of costly patients, known as high-need, high-cost (HNHC) patients-typically defined as those who account for the top 5% of annual healthcare costs. Research has shown that the top-1% and top-5% high-cost patients accounted for 23% and 50 %, respectively, of all healthcare costs 2 . A prediction model for future HNHC patients has been attracting the attention of policymakers and payers in recent years due to an expectation that interventions targeting this population may be more effective in reducing healthcare spending than interventions targeting the entire population [3][4][5][6] . Not only the Japanese government but also the Organization for Economic Co-operation and Development (OECD) is considering HNHC patients as one of the policy targets priorities that have the potential to effectively curb rapidly growing healthcare costs 7 . Therefore, a valid, reliable, and implementable approach to accurately predict HNHC patients in real-time is critically important for designing targeted interventions that can effectively lower healthcare spending.
Although studies have sought to develop accurate models for predicting HNHC patients, their performance remains suboptimal due to the complex interplay among predictors and the lack of detailed clinical information (e.g., body mass index [BMI], blood pressure level, laboratory data) in the data used to construct the prediction models. Many of the existing studies on prediction models of high-cost patients relied on data from claims, selfreported data, or electronic health records that do not include laboratory test results [8][9][10][11][12][13][14][15][16][17] . Evidence is limited as to whether the data from laboratory tests-which are, arguably, more granular and detailed clinical information-leads to the improvement of the performance of the prediction model. A recent study using data from the health screening program and claims data in South Korea reported an improvement in the performance of the machine-learning-based prediction model for the top 10% of high-cost patients 18 . However, despite that many insurers and providers globally are actively seeking for an approach that can accurately predict HNHC patients [3][4][5][6]13,19,20 , it remains largely unclear whether the machine-learning-based prediction model using the detailed clinical information collected through a health screening program combined with claims data could achieve high prognostic performance for predicting HNHC patients in subsequent years 18 .
In this context, we developed and evaluated machine-learningbased prediction models for HNHC patients using data from national screening programs and claims in Japan. A developed model based on administrative data would be immensely helpful for policymakers and payers to identify effective strategies to contain rapidly growing healthcare costs.

Beneficiary characteristics
During the study period, the database included 363,165 adults who underwent the national screening programs every fiscal year (from April 1 through March 31) in 2013-2016. Of those, we used a 10% random sample (n = 36,316) for the analyses (Table 1). Among 36,316 individuals in our analytic cohort, 21,985 (61%) were male, and the median age in 2013 was 43 years. In 2016, the median annual healthcare cost was 43,270 JPY (376 USD; using an exchange rate of 115 yen per US dollar as of December 2016), and the top 1%, 5%, 10% of patients accounted for 26%, 48%, and 60%, respectively, of all amounts of annual healthcare costs (Fig. 1).

Prediction of HNHC patients
The discrimination ability of different models, as represented by ROC curves, is shown in Fig. 2. The logistic regression model (the reference model) had the lowest discriminative ability (c-statistics, 0.82; 95%CI, 0.81-0.84), while all other machinelearning-based models had a high discriminative ability ( , and the reference model had the best negative likelihood ratio of 0.34 (95%CI, 0.31-0.36). In the decision curve analysis (Fig. 2), compared with the reference and the Lasso regression model, the net benefit for other machine-learning-based models (e.g., the random forest model) was higher over the range of threshold probabilities, with the random forest and gradient-boosted decision tree model having the greatest net benefit. Given that the number of HNHC patients in our cohort (1815 adults) was substantially larger than the number of parameters used for the prediction (25 parameters), we assumed that the risk of overfitting was low. Indeed, the event per variable (EPV) for our primary prediction model was 72, indicating a low risk of overfitting (EPV < 20 is indicative of potential overfitting) 21 . We found no predictors with high variance inflation factors (VIF) (>10) among parameters included in our reference model, indicating that collinearity is not an issue for our prediction models (Supplementary Table 1) 22 . Values represent n (%), unless otherwise indicated. IQR interquartile range, HbA1c hemoglobin A1c, TG triglycerides, LDL-C low-density lipoprotein cholesterol, HDL-C high-density lipoprotein cholesterol, AST aspartate aminotransferase, ALT alanine aminotransferase, y-GTP gamma-glutamyl transpeptidase, ECG electrocardiogram.
I. Osawa et al.

Variable importance
The variable importance in the random forest and gradientboosted decision tree model (model 1) was demonstrated in Fig. 3. In both models, healthcare cost in the previous year was the most important predictor for HNHC patients. In the random forest model, obesity-related metrics (e.g., body weight and waist circumference), gender, and blood pressure were significant in addition to healthcare cost. In the gradient-boosted decision tree model, age, the use of anti-hypertensive drugs, and predictors related to blood sugar level (e.g., Hb1Ac, fasting blood sugar, and hypoglycemic drugs) were important in addition to healthcare cost.

Sensitivity analyses
We found a similar high performance of the machine-learningbased prediction models when we used different thresholds for defining HNHC patients ( Table 3). The Lasso regression model had the highest prediction performance (c-statistics, 0.86; 95%CI, 0.84-0.88) for predicting those who account for the top 1% of healthcare cost, and the gradient-boosted decision tree model had the highest prediction performance (c-statistics, 0.88; 95%CI, 0.87-0.88) for predicting those who account for the top 10% of healthcare cost. We found no qualitative differences in the discriminative ability between the prediction models using singleyear data and those using consecutive 2-year data (Tables 4 and 5). Annual healthcare cost in the previous year was the most important predictor for HNHC patients in the subsequent year (Fig. 3). The machine-learning-based prediction models using both clinical data from the screening program and healthcare cost calculated using claims data marginally improved the prediction performance compared to the models using only patient age, gender, and healthcare cost calculated using claims data (Tables 6 and 7). We found that adding 21 major diagnosis categories and 22 major procedure categories to a set of predictors included in our model did not improve the prediction ability significantly compared with the reference model   Table 2. B Decision curve analysis. The X-axis indicates the threshold probability for HNHC patients. The Y-axis indicates the net benefit. The curves (decision curves) indicate the net benefit of models (the reference model and four machine-learningbased models) as well as two clinical alternatives (classifying no people as HNHC patients vs. classifying all people as HNHC patients) over a specified range of threshold probabilities of outcome. Compared to the reference model, the net benefit, which is defined as the following equation, for all machine-learning-based models was greater across the range of threshold probabilities. "net benefit = (1 − false negative) × prevalence − false positive × (1 − prevalence) × the odds at the threshold probability". HNHC patients = high-need, high-cost patients.

DISCUSSION
Using the nationally representative data of individuals who underwent the national screening program in Japan, we found that HNHC patients accounted for almost half of all annual healthcare costs, similar to the findings from other developed countries 2,4,13 . Machine-learning-based prediction models using both clinical data and healthcare cost calculated using claims data exhibited good prognostic performance for predicting HNHC patients in the subsequent year, compared with the prediction models relying only on claims data. The prediction models using consecutive 2-year data did not significantly improve the prediction performance compared to the models using singleyear data. Taken together, these findings highlight the importance of incorporating clinical data-such as laboratory test results-in developing machine-learning to achieve high performance in predicting HNHC patients. We found that adding clinical data from the screening program to the data on healthcare cost from claims marginally improved the performance of the prediction models. On the other hand, we found no meaningful improvements in prediction ability by adding the extra information on diagnosis and procedure to our prediction models-the finding consistent with prior studies 12,18 . This is probably because healthcare cost is a function of the individual billing codes (e.g., the diagnosis, procedures, and medications billed during hospitalization or the outpatient visit) available in the claims data, and therefore, once the healthcare cost in the preceding year is included as a predictor, the additional benefit of adding a broad set of variables from the claims may be negligible. On the contrary, clinical data from the national screening program used in our study may provide complementary information about the participants' health status that are not available in the claims data (e.g., HbA1c level), and thus, the inclusion of both clinical data from the screening program and healthcare cost data from claims led to a better performance of the prediction model compared to the models that used only one of these two databases.
Multiple studies to date have sought to identify future HNHC patients using conventional approaches such as logistic regression models 9,11,12,14 . For example, a study that developed a model using logistic regression to predict the top 25% of HNHC patients of employees among the United Auto Workers using claims data and self-reported health data reported the c-statistics of 0.78 and 0.73, respectively 14 . While these conventional approaches had moderate prognostic performance, advanced machine-learningbased approaches have the potential to improve their prediction ability 8,12,15,18 and possess scalability (e.g., extracting important features from a prediction model using many variables without physicians' interpretations) 23 . Tamang et al. developed a model using elastic-net penalized logistic regression to predict the top 10% of HNHC patients using claims data in Denmark and showed the best c-statistics of 0.84 12 . The prediction ability of our prediction model for HNHC patients was better than similar machine-learning-based prediction models reported in previous studies 12,16 . The difference is likely due to the inclusion of detailed clinical data collected through the screening program. Until recently, there have been some studies that developed a machine-learning-based prediction model for HNHC patients using clinical data from administrative claims data [8][9][10][11][12][13][14][15][16][17] ; however, few studies developed prediction models incorporating laboratory data to claims data. In 2019, Kim and colleagues analyzed the data from South Korea and found a marginal improvement in prediction ability for the top 10% of HNHC patients by adding clinical data from the screening program to claims data 18 . Table 2. Prediction ability of the reference and four machine-learning-based prediction models for HNHC patients. Our findings using Japan's data were consistent with what they found using South Korean data, which supports the robustness and generalizability of our findings.
Our study has limitations. First, our data mainly consisted of the working-age population aged 18-75 years who underwent the national screening program. Therefore, our findings may not be generalizable to other age groups such as children and the elderly. This is particularly important as elderly people often account for a large proportion of healthcare expenditures in developed countries 24,25 . Second, our data from the national screening program included missing data (0.1%-36% of data in continuous variables), which could be a potential source of bias. However, we believed this issue could be minimized in our analyses under the use of random forest imputation for missing continuous variables (known to be a rigorous technique for the imputation of missing data) 26 . Lastly, given that we used nationwide data from Japan, our findings may not be generalized to the prediction of HNHC patients in other countries. However, our findings were consistent with a recent study conducted in South Korea 18 , which suggests potentially high generalizability of our findings in other contexts.
In summary, using nationally representative data from Japan, machine-learning-based models for predicting HNHC patients using clinical data from the national screening program and claims yielded a good prediction performance. In Japan, the time between the submission of claims by healthcare providers to insurers and the data to be available for the analysis is approximal two months. Therefore, our prediction models have the potential to inform policymakers and insurers by accurately predicting future HNHC patients in real-time and intervene if necessary, with the aim of curbing rapidly growing healthcare costs due to the aging population.

Data source and study population
We analyzed data from the nationwide claims database (MinaCare database) from April 1, 2013, to March 31, 2016. The MinaCare database collects claims from large employers and currently covers~7.3% of the Japanese working population. This database includes working individuals and their dependent family members, with a wide range of age groups 27 . From the nationwide claims database, we used a 10% random sample of 363,165 adults aged 18 years and older who underwent the national screening programs every year from 2013 through 2016 (~45% of participants who received the screening in 2013 were included in our final sample).
In Japan, all adults are required by law to undergo the national screening program at least once a year, according to the Industrial Safety and Health Act enacted in 1972 28 . The screening program is standardized nationally, and includes several examinations, tests, and surveys, including demographics (height, body weight, waist circumference), eyesight, hearing, chest X-rays, blood pressure, laboratory tests (blood tests and urinalysis), electrocardiograms, past medical history, occupational history, and subjective and objective symptoms (Supplementary Table 3 . From the claims data, we included annual healthcare cost in the prior year, which includes all healthcare costs (except for a very small proportion of healthcare services that were not covered by health insurance [e.g., costs for cosmetic surgeries, the costs of over-the-counter drugs]) and has been shown to be one of the strongest predictors of future healthcare costs 7,15,18 . We decided to use only the data on healthcare cost in prior years the development of our primary prediction models because prior studies found no meaningful improvements in the prediction ability by adding a large number of variables available in the claims data (e.g., the diagnosis, procedures, and medications billed during hospitalization or the outpatient visit) to the data on healthcare cost 12,18 .

Outcomes
The outcome was becoming an HNHC patient in the subsequent year. We defined HNHC patients as those who account for the top 5% of annual healthcare costs, an approach used in prior studies 4,6 .

Statistical analysis
Machine-learning-based models. We developed five machine-learningbased models to predict HNHC patients in the subsequent year: (1) logistic regression (used as the reference model), (2) logistic regression with Lasso regularization (Lasso regression) 29 , (3) random forest 30 , (4) gradientboosted decision tree 31 , and (5) deep neural network 32 . Lasso regularization is an extended standard regression model with the regularization parameter (lambda) to shrink large coefficients toward zero and minimize potential overfitting in the model by using a glmnet package 29,33 . Random forest is an ensemble of decision trees created by bootstrap aggregation and random feature selection 34 . Gradient-boosted decision tree is an additive model of decision trees estimated by gradient descent 31,35 .  Table 3. Prediction ability of the reference and four machine-learning-based prediction models for top 1% or 10% healthcare cost spenders. We compared the area under the curve between each machine-learning-based prediction model and the logistic regression model (the reference model) using the DeLong's test. Table 4. Prediction ability of the reference and four machine learning models using consecutive 2-year data for HNHC patients. We used a non-penalized logistic regression model as the reference model. b We compared the area under the curve between each machine-learning-based prediction model and the logistic regression model (the reference model) using the DeLong's test.
I. Osawa et al.
We used a grid search strategy to identify the best tuning hyperparameters by using ranger and caret packages for the random forest and gradient-boosted decision tree model 30,36 . Deep neural network is a machine-learning algorithm using multiple layers to model the nonlinear relationship between predictors and outcome 37 . We constructed a multiple-layer, feedforward model with adaptive moment estimation optimizer 38 using a keras package for R version 3.6.1 32 and developed the final models by manual tuning of the hyperparameters (i.e., the number of layers, hidden units, learning rate, learning rate decay, dropout rate, batch size, and epochs).
Model development, validation, and assessment. We first developed prediction models using predictors in the 2014 data and the outcome in the 2015 data (i.e., HNHC patients in 2015). Next, we validated these prediction models using predictors in the 2015 data and the outcome in the 2016 data. All predictors we used and the number of missing and nonresponded data are shown in Supplementary Table 4. We conducted multiple imputations for missing data in continuous variables by using the random forest method 39 , and used the following variables for multiple imputations: patient demographics, blood pressure levels, laboratory data, and survey responses. Random forest imputation is a nonparametric algorithm that can accommodate nonlinearities and interactions and does not require a particular parametric model to be specified 26 . The single point estimates were generated by random draws from independent normal distributions centered on conditional means predicted using random forest. Random forest uses bootstrap aggregation of multiple regression trees to reduce the risk of overfitting, and it combines the estimates from many trees 39 . On the contrary, all non-responded data in survey responses (i.e., questionnaires for past medical history, social history) and ECG abnormalities, were assumed to be normal based on clinical reasoning. In model development and validation, we used several techniques to minimize potential overfitting-e.g., (1) Lasso regularization, (2) cross-validation (Lasso regularization, random forest, and gradientboosted decision tree), (3) out-of-bag estimation (random forest, and gradient-boosted decision tree), (4) dropout and batch normalization (artificial neural network), and (5) validation of each model by using the data in different years. To address the potential collinearity of parameters included in our prediction model, we calculated the variance inflation factor (VIF) 22 .
The prediction performance of each model was assessed by computing (1) c-statistics (i.e., the area under the receiver-operatingcharacteristics [ROC] curve), (2) prospective prediction results (i.e., sensitivity, specificity, positive predictive value, negative predictive value, positive likelihood ratio, and negative likelihood ratio), and (3) decision curve analysis. To address the class imbalance in the outcome (e.g., the low proportion of individuals who were classified as HNHC), we chose the threshold of prospective prediction results based on the ROC curve (i.e., the Youden index) 40 . The decision curve analysis is a measure that takes into account the different weights of different misclassification types with a direct clinical interpretation (e.g., trade-offs between under-and over-estimation for each model) 41,42 . Specifically, the relative impact of false-negative (under-estimation) and false-positive (over-estimation) results given a threshold probability (or clinical preference) was accounted to yield a "net benefit" in each model. The net benefit of each model over a specified range of threshold probabilities of outcome was defined as "Eq. (1)" and graphically displayed as a decision curve 41,42 .
Àfalse positive 1 À prevalence ð Þ the odds at the threshold probability (1) To gain insights into the contribution of each predictor to machinelearning-based models, we also computed the variable importance in the random forest and the gradient-boosted decision tree. The variable importance is a scaled measure to have a maximum value of 100 36,43 . DeLong's test was used to compare ROC curves 44 .

Sensitivity analyses
We conducted several sensitivity analyses. First, we used different thresholds for defining HNHC patients: those who account for the (1) top 1% and (2) top 10% of annual healthcare costs. Second, as the prediction models using longitudinal data may have better prediction ability, we developed prediction models using consecutive two-year data (i.e., data in 2013-2014) and the outcome in 2015. We then validated the models using predictors in 2014-2015 and the outcome in 2016. To assess the benefit of including clinical data from the national screening programs as predictors, we compared three machine-learning-based prediction models: (1) the model using only clinical data collected through the screening program, (2) the model only using patient age, gender, and healthcare cost data from claims data and (3) the model using both clinical data from the screening program and healthcare cost data calculated using claims data. Lastly, we developed the prediction models additionally including the data on diagnosis and procedure available in the claims data (21 major diagnosis categories and 22 major procedure categories) as predictors to investigate whether adding detailed claims data to our primary models improves the prediction performance.
A P-value of < 0.05 was considered statistically significant. All analyses were performed with R version 3.6.1. (The R Foundation for Statistical Computing). This study was a secondary data analysis of de-identified data (fully anonymized prior to receiving the data), and therefore, it was exempt from The University of California, Los Angeles Institutional Review Board review and participant consent was not required.

Reporting summary
Further information on experimental design is available in the Nature Research Reporting Summary linked to this paper.

DATA AVAILABILITY
The MinaCare data are the proprietary of MinaCare, Co., Ltd. and not publicly available for the research purpose. The researchers who would like to access the data for the research purpose should contact Dr. Yuji Yamamoto (mc_info@minacare.co.jp) in order to make a data use agreement and pay fee to have the data available.

CODE AVAILABILITY
Statistical codes and machine learning algorithms will be made available upon request submitted to the author. We used a non-penalized logistic regression model as the reference model. b We compared the area under the curve between each machine-learning-based prediction model and the logistic regression model (the reference model) using the DeLong's test.
I. Osawa et al. Table 6. Prediction ability of the reference and four machine-learning-based prediction models for HNHC patients using part of clinical and claims data. We used a non-penalized logistic regression model as the reference model. b We compared the area under the curve between each machine-learning-based prediction model and the logistic regression model (the reference model) using the DeLong's test. Table 7. Contribution of the predictors to the prediction ability. We compared the area under the curve between each machine-learning-based prediction model and the logistic regression model (the reference model) using the DeLong's test.
I. Osawa et al.