A risk stratification tool for hospitalisation in Australia using primary care data

Predictive risk models using general practice (GP) data to predict the risk of hospitalisation have the potential to identify patients for targeted care. Effective use can help deliver significant reductions in the incidence of hospitalisation, particularly for patients with chronic conditions, the highest consumers of hospital resources. There are currently no published validated risk models for the Australian context using GP data to predict hospitalisation. In addition, published models for other contexts typically rely on a patient’s history of prior hospitalisations, a field not commonly available in GP information systems, as a predictor. We present a predictive risk model developed for use by GPs to assist in targeting coordinated healthcare to patients most in need. The algorithm was developed and validated using a retrospective primary care cohort, linked to records of hospitalisation in Victoria, Australia, to predict the risk of hospitalisation within one year. Predictors employed include demographics, prescription history, pathology results and disease diagnoses. Prior hospitalisation information was not employed as a predictor. Our model shows good performance and has been implemented within primary care practices participating in Health Care Homes, an Australian Government initiative being trialled for providing ongoing comprehensive care for patients with chronic and complex conditions.

Predictive risk models using general practice (GP) data to predict the risk of hospitalisation have the potential to identify patients for targeted care. Effective use can help deliver significant reductions in the incidence of hospitalisation, particularly for patients with chronic conditions, the highest consumers of hospital resources. There are currently no published validated risk models for the Australian context using GP data to predict hospitalisation. In addition, published models for other contexts typically rely on a patient's history of prior hospitalisations, a field not commonly available in GP information systems, as a predictor. We present a predictive risk model developed for use by GPs to assist in targeting coordinated healthcare to patients most in need. The algorithm was developed and validated using a retrospective primary care cohort, linked to records of hospitalisation in Victoria, Australia, to predict the risk of hospitalisation within one year. Predictors employed include demographics, prescription history, pathology results and disease diagnoses. Prior hospitalisation information was not employed as a predictor. Our model shows good performance and has been implemented within primary care practices participating in Health Care Homes, an Australian Government initiative being trialled for providing ongoing comprehensive care for patients with chronic and complex conditions.
The growing burden of chronic conditions (also known as non-communicable diseases), is now responsible for 70% of deaths globally 1 . Chronic conditions affect one in two people in developed countries and the World Health Organisation attributes three quarters of global chronic condition-related deaths to developing countries 1 . There is a strong imperative to address this disproportionate burden of chronic conditions on health care globally.
Worldwide efforts to move from episodic to integrated and coordinated care have delivered significant improvements in the management of chronic conditions in the primary care sector, demonstrating benefits to patients and the healthcare system [2][3][4][5][6][7] . To ensure optimal use of limited healthcare resources, it is crucial that such programs and resources target patients who would otherwise be admitted for an unplanned or avoidable hospitalisation.
The best performing risk stratification algorithms for primary care settings have generally been developed in locations where large linked datasets are practical and available. These datasets can offer good coverage of data variables and include information from outside the primary care setting, such as the patient's history of previous hospitalisations, a highly significant predictor in such models 8,9 . However, such linked data sets are not available in most health systems throughout the world. For example, in Australia there are no simple mechanisms to link patient records across primary and acute care settings. Furthermore, patient information about previous hospitalisations is not available in most primary care information systems. In such cases, there is a need for new algorithms based on data from the setting in which they will be used 8 .
Health Care Homes is a new Australian Government initiative with ongoing comprehensive care provided for up to 65,000 patients with chronic and complex conditions in up to 200 primary care clinics and Aboriginal Community Controlled Health Services (their 'Health Care Home'). Patients eligible for the 22-month trial are identified in the general practice (GP) using an algorithm that predicts the risk of a patient being hospitalised over the next 12 months. By targeting services to patients with an imminent risk of hospitalisation, a significant Participants. The Victorian primary care patient database contained data for over 1.8 million patients. This was filtered to consider only patients who attended their primary care clinics at least once between 2011 and 2016, resulting in approximately 1.3 million patients. To ensure every patient in the resulting cohort had at least one year of history ("look back") available, and that every patient visited their GP after the one-year prediction period, and was alive for the entire prediction period, two key dates were calculated for each patient. "One year ahead" is one year from the date of a patient's first visit (minimum value 1 Jan 2012). "One year back" is one year before the date of a patient's last visit (maximum value 31 December 2015). The patient cohort was subsequently filtered to include only patients where the "One year back" date occurred on or after the "One year ahead" date.
The cohort was further filtered to exclude patients below the age of 18 and above the age of 106 years, patients without Victorian postcodes, patients with gender other than "Male" or "Female", and physiological observations and pathology records with inconsistent or incorrect units and values. Figure 1 presents the cohort population at each stage of the cohort selection process. The resulting cohort, hereafter referred to as the Primary Cohort, comprised 393,229 patients, of whom 7.2% had hospitalisations of interest. A chronic-only subset of the Primary Cohort was also created for sub-analysis and is described in following sections.
To avoid a bias arising from consistently choosing the earliest or latest possible date as a patient's "Prediction Date", a random date was chosen between the "One Year Ahead" and "One Year Back" dates. While ensuring that sufficient history and prediction windows were available, this suited the perceived consumer use of the model, where GP practices may want to predict on their cohort at an "ad hoc" time period, not linked to a particular prediction date. Figure 2 presents the methodology chosen for selecting a random prediction date for each patient. The public hospitalisation extract revealed several patients that appeared to be different people in the primary care dataset but were determined to be the same person. Where this was identified, their primary care records were merged to improve data quality.

Emergency or Potentially Preventable Hospitalisations.
The outcome variable of interest for study was whether a patient has an Emergency or Potentially Preventable Hospitalisation within one year of the prediction date. Potentially Preventable Hospitalisations were included as they are a defined indicator for ambulatory care sensitive conditions 10 and the case for targeting these through general practice integrated care interventions is well justified 11,12 . An Admission Type flag was employed to identify Emergency Hospitalisation. International Classification of Diseases (ICD) and Procedure Code information was employed to identify Potentially Preventable Hospitalisations using the guideline provided by the Australian Institute for Health and Welfare 13 . A binary (1/0) "One Year Emergency or Potentially Preventable Hospitalisation" variable was created after calculating, for each patient, the days from their prediction date to their next Emergency or Potentially Preventable Hospitalisation, whichever came first. The data comprised an extract from the inpatient (VAED) and emergency department (VEMD) datasets linked to the death registry (Registry of Births, Deaths and Marriages, Victoria). The only records of interest were related to inpatient hospitalisation.

Predictors.
The list of all predictors used to develop the model is shown in Table 1 with further description below.
Demographics. Age at cohort entry was expected to be a key predictor of risk and was implemented as a continuous variable instead of categorical to prevent loss of information within categories. To capture a non-linear relationship of risk on age, the square root of age (√age), age 2 (age × age) and age 3 (age × age × age) predictors were also created. This was preferred over Cubic Splines, given that these were more practical to implement within the production environment of software systems used in primary care practices participating in the Health Care Homes trial.
For the Australian context, ethnicity was defined as Indigenous Australians, non-Indigenous Australians, or Unknown (where no ethnicity data was available).
Postcode was used to match with the Australia Bureau of Statistics' SEIFA Index of Relative Socio-economic Advantage and Disadvantage (IRSAD) 14 , and with the Australian Department of Health's Modified Monash Model (MMM) 15 . The SEIFA index summarises information about the economic and social conditions of people and households within an area, including both relative advantage and disadvantage measures, and was employed as a categorical decile value (i.e. grouped into 10 bands with lower values signifying higher levels of disadvantage, plus "Unknown" for missing postcodes). MMM is a recently developed geographical classification system designed to better address the maldistribution of medical services across Australia, taking values ranging from 1 to 7, with increasing values representing higher levels of rurality, plus "Unknown" for missing postcodes.
Patient Observations and History. Recorded physiological observations were included following data preparation to remove non-numeric characters, conversion of units to standardised forms, and removal of values and units identified as being "incorrectly recorded". Since all physiological observations are not repeated at each GP visit, the system looked back through all available physiological observations (going back to 2007) to retrieve the most recent available reading.
Smoking status was included as a four-category predictor. Four variations of alcohol consumption were included. In the Australian context, the main software systems collect the number of alcoholic drinks per day consumed by the patient on average, the number of days per week the patient consumed alcohol on average, both of these, or a related measure. To increase the usability of the model in the Australian context, we defined alcohol consumption as a binary condition indicated by any consumption of alcohol per day and alcohol days per week data. We note that "volume" information is lost when converting to a binary category which is commented on in the Discussion. An additional category was included to indicate no alcohol consumption data was available. Figure 2. Illustration of Prediction Date as a random day between the "one year ahead" and "one year back" dates. Note: The "look back" and "look ahead" periods ensured that sufficient history and prediction window is available. A random date helped avoid bias while also suiting the perceived consumer use of the model.

Continued
Height, weight and body mass index (BMI) were also employed for prediction. BMI was calculated where it was not specifically recorded, but where height and weight were available. Height and weight were used as continuous predictors while BMI was employed as a categorical predictor, either with four categories or with six categories. In the 4-category predictor, the categories corresponded to: underweight/normal, overweight, obese, and "not recorded". In the 6-category predictor, the obese category was subdivided into moderately obese, severely obese, and very severely obese.
Medication Variables. The usage of certain medications as indicated in the GP prescription data was included in the modelling using 0/1 binary indicator variables for each of six medication groups (statins, anticoagulants, antidepressants, antipsychotics, anti-inflammatories, and steroids). Groupings of medications into each medication group was completed in consultation with clinical and pharmacy experts from the Australian Government Department of Health. Data was searched back to 2007 (see Supplementary Table S1). The number of prescribed medication groups was also included in modelling either as the total number of prescribed groups (range 0-6) or as a categorical variable with five categories.
Diagnosis Variables. Grouping of diagnosis conditions/diseases of interest into "diagnosis families" was completed in consultation with the Australian Government Department of Health. In total, 35 diagnosis families were identified as relevant for the model (e.g., type 1 and 2 diabetes, osteoporosis, rheumatoid arthritis, cancer, learning difficulties, falls etc.). Data was searched back to 2007. Due to sparsity in the hospitalisation data, these "diagnosis families" were then further combined into 16 "disease groups". The grouping was logical and clinically relevant (e.g., hyperlipidaemia, hypercholesterolaemia, and hypertriglyceridaemia were combined into a "bloodfats" category), and was informed by the incidence of each of the diagnosis conditions to ensure counts of the employed categorical variable were sufficient for analysis. A 0/1 binary indicator predictor for each group indicated the presence of at least one of the constituent diagnoses of interest in the GP records. The mapping of diagnosed conditions into diagnosis families and diagnosis groups is presented in Supplementary Table S2. These diagnosis group variables were the primary means to include the occurrence of relevant diagnosed conditions/ diseases in the models. Note that these predictors correspond to diagnosis, not occurrence; they provide no way to capture additional risk from undiagnosed conditions/diseases of interest. However, the morbidity risk groups from pathology test results do provide a limited way to do this. Comorbidity associated with each patient was also included in modelling as the total number of distinct diagnosis families, either as a continuous predictor (range 0-35) or as a categorical variable with nine categories.
A "Chronic-only" group of diagnosis families was also defined, being one of 33 chronic conditions (i.e. any of the 35, excluding learning difficulties and falls). A "Chronic-only subset" of the patient cohort was then defined as including only those patients who had at least one diagnosis belonging to one of the "Chronic-only" group of diagnosis families.
Pathology Variables. Twelve pathology results were included as predictors. Following removal of values and units identified as being "incorrectly recorded", the most current value of the available pathology results was used. Normal ranges for pathology results were defined in consultation with the Australian Government Department of Health and used to calculate morbidity risk flags representing three categories of risk -Low, Medium and High (see Supplementary Table S3). While not strictly a pathology test result, blood pressure was handled in www.nature.com/scientificreports www.nature.com/scientificreports/ a similar way. Due to data sparsity, medium and high levels were each combined for Bilirubin, Creatinine and Triglycerides.
For a number of patients, no test results were available. If these were treated as missing values it would be very difficult to include those patients in models using test results as predictors. An alternative approach we employed was to add a category for "no test history" for each predictor which enabled inclusion of all patients. Note that the absence of a test result is not the same as a test result indicating low risk.

Methodology.
A predictive model for the binary event of Emergency or Potentially Preventable hospitalisation within one year was developed. Several machine learning approaches were chosen for model development and validation. Logistic regression 16 was chosen because it is an established method for prediction in binary problems and implementation of models in a range of production environments is straightforward using basic mathematical functions. For comparison, four alternative types of models were also considered (Naïve Bayes 17 , Artificial Neural Networks 18 , Random Forests 19 and Generalised Boosting 20 ). Random Forests, Artificial Neural Networks and Generalised Boosting algorithms were considered because of their established superiority in pattern recognition from large complex data. Naive Bayes was employed as it offers a different approach to model building. Multiple variants of the full model were explored. For example, BMI with four or six categories, alcohol consumption as drinks per day or non-drinker/drinker, and the number of diseases as a numeric or a categorical variable were each considered. For each model variant, the performance of the logistic regression model and the four machine learning approaches were assessed.
Models were compared and validated using the area under the receiver operating characteristic curve (AUC) with 10-fold cross-validation. Optimism was estimated using bootstrap validatation 16 with 100 bootstrap samples. For additional results (e.g., receiver operating characteristic (ROC) curves, calibration curves), patients were randomly split into 70%/30% training and test portions.
The subgroup of patients with at least one chronic condition (Chronic-only subset) is an important subgroup of primary care patients, especially since many programs like Health Care Homes tend to focus on patients with established chronic conditions. Although models were developed using the entire cohort of patients, additional results for this subset of patients were obtained and are reported in the Supplementary material (see Supplementary Results for the Chronic-Only Subset). This approach was used instead of restricting all model development and validation to the chronic-only subset to take advantage of the information and large patient numbers amongst those without a diagnosed chronic condition.
Software. All analyses were performed using the R statistical computing environment. Logistic regression models used the "glm" command. Random forest models employed the "randomforest" package. Generalised boosting models used the "GBM" package. Naïve Bayes models employed the "klaR" package. Artificial Neural Networks models employed the "nnet" package.
Code Availabilty. The code used for model development and validation is available from the corresponding author on reasonable request for non-commercial purposes.

Results
Participants. Table 2 summarises the patient characteristics in the Primary Cohort all together, as split by the outcome variable, and as split randomly into 70% training (development) and 30% testing (validation) datasets. For categorical quantities, values used as the reference category in modelling are listed first. Table 3 profiles hospitalisations for the Primary Cohort. The rate of emergency or potentially preventable hospitalisations within one year is 7.2%. Table 4 presents a comparison of AUC results using 10-fold cross-validation for the best performing model (the "final" model), a simpler canonical model (age and number of diagnosis families) and a variant on the final model capturing the size of alcohol consumption using the 6-category variable for alcohol per day. (Additional results for the Chronic-only subset are provided in the Supplementary material). Two modelling approaches, Logistic Regression and Generalised boosting are presented as no other method had higher AUC values. Results from 10-fold AUC for these other models were never better than for the final model. The bootstrap estimate of optimism of the final logistic regression model is 0.0017.

Model Performance.
As part of the model development process, deviance residuals for the logistic regression models were calculated. For seven variables (age, gender, ethnicity, 6-category BMI, smoking status, any alcohol consumption and SEIFA IRSAD), ANOVA was used to check whether there was a statistically significant relationship between the predictor and the deviance residuals, the presence of which would suggest the model was missing a key relationship. The checks were performed for the residuals of the entire training dataset and for just the Chronic-only subset of the training dataset. A highly significant relationship for gender was found in the Chronic-only subset (P = 0.002) but not in the entire training dataset (P = 0.168). Interaction terms between diagnosis group and gender were investigated by creating an interim model with all diagnosis group-gender interaction terms. Interaction terms in the model were retained for the final model if the P-value for the interaction term in the model output was below a generous threshold of 0.5. As a result, 9 interaction terms were included in the final model. Repeating the ANOVA checks for the final model, inclusion of these 9 terms largely removed the effect for gender (training dataset P = 0.043, Chronic-only P = 0.052). P-values for other variables were always larger than 0.03 for both datasets before and after adding the interaction terms. Figure 3 shows the receiver operating characteristic (ROC) curve for the final model. Labelled points indicate the corresponding location of risk for indicated quantiles. The diagonal reference line is the line of no discrimination and corresponds to the performance of random guesses. Figure 4 shows a calibration curve for risk groups defined www.nature.com/scientificreports www.nature.com/scientificreports/  Continued www.nature.com/scientificreports www.nature.com/scientificreports/ by deciles of predicted risk. For each group, the horizontal coordinate is the mean of predicted risk of hospitalisation and the vertical coordinate is the mean (with 95% confidence interval) for the observed proportion of hospitalisations. The dashed diagonal line indicates ideal performance where observed and predicted risk are equal.
Using a 70%/30% train/test data split, we checked the amount of bias in our unknown/not recorded categories by stratifying the test dataset into known and unknown parts for variables of interest (BMI, ethnicity, smoking, alcohol consumption, IRSAD advantage/disadvantage). Predictive performance on the 30% test dataset (AUC 0.665) is similar to that from 10-fold cross validation (AUC 0.663), so the results should be indicative. Calibration for the known parts are universally excellent. For the unknown parts, if there is bias it is small. Comparing mean predictions to observed proportions of hospitalisations, all results are within confidence intervals (BMI mean prediction 0.184, observed 0.190 (95% CI: 0.181-0.199; ethnicity mean prediction 0.212, observed 0.218 (95% CI: 0.202-0.234), smoking mean prediction 0.214, observed 0.233 (95% CI: 0.211-0.255); IRSAD unknown or AnyAlcohol not recorded mean prediction 0.232, observed 0.264 (95% CI: 0.228-0.300). For all of these results, the difference in mean with the neighbouring risk group decile greatly exceeds the difference between known/ recorded and unknown/not recorded within the decile.     www.nature.com/scientificreports www.nature.com/scientificreports/ Columns 3-4 shows the standard error and P-value for each coefficient. Column 5 shows the odds ratio with 95% confidence interval for each category compared to its reference level. Odds ratios are not provided for continuous predictors as their interpretation is made more complicated by quadratic and higher order terms.

Model Specification and Computation.
As with other logistic regression models, predictions are calculated via the linear predictor. The linear predictor, y, is defined using the equation where for each i, C i is the model coefficient and V i is the corresponding variable value (e.g., 0 or 1 for binary indicator and dummy coding variables). The value of the intercept is also shown in the table. Note that for this model,

Discussion
This study presents a prediction model designed for Australian primary care practices to identify patients with chronic conditions in their patient population that are at high risk of hospitalisation over the next 12 months. Despite not having access to a rich linked data repository and high impact predictors such as previous hospitalisation history, the prediction model performs at similar levels to other state-of-the-art prediction models and demonstrates the efficacy of using this approach in similar settings.
Our model shows good performance by the area under the Receiver Operating Characteristic curve (AUC~0.66) and is well calibrated. As might be expected, obesity and smoking increase the risk of hospitalisation, while higher economic advantage lowers the risk of hospitalisation. Interestingly, alcohol consumption is also associated with decreased risk of hospitalisation. However, in our cohort many more drinkers have trivial to moderate consumption, so we would expect the category to be dominated by these kinds of drinkers. QAdmissions 8 reported decreased risk of hospitalisations for this group, compared with increased risk for heavy drinkers, so our model appears consistent with other published results. We further refined our focus to the subset of patients with   www.nature.com/scientificreports www.nature.com/scientificreports/ at least one diagnosed chronic condition, for which targeted healthcare is more important. On this subset, our model showed even better performance. Smoothed plots of predicted vs observed probabilities suggest that our model is doing well to capture increasing risk by age and number of diseases, two variables known to be important for this kind of risk stratification.
In current literature, variables related to prior health care utilisation have been shown to have a strong influence on the performance of such models. A review by Wallace et al. 9 explored 18 risk prediction models that used routinely collected data for prediction. Seven of the 8 models in this group that were comparable to this study employed prior emergency admission in the final model, while the eighth created customised models that used a combination of data sources that included hospitalisation and community data. For example, the QAdmissions 8 algorithm predicts emergency admission within 2 years (AUC 0.77-0.78) but, like most other algorithms, employs prior emergency hospitalisation information (sourced from both GP data and linked hospital episode data) as the most significant predictor.
Heterogeneity in the definitions of study populations and outcome measures limits the ability to directly compare our model's performance with other work. Models such as the adapted PEONY 21 and HARP 22 , and the models presented by Johnson et al. 23 focus on older patients only and are therefore not directly comparable. In addition, while Johnson et al. present models based on "GP-like-data", data for their study is drawn from a self-reported, English-only, voluntary survey and simulates conditions where data is accessed through primary care practice management systems. It is therefore likely to be more complete than real-world primary care data. Also, they use survey variables like 'health condition count' and 'self-rated health' to approximate the patient health history information that is commonly captured in GP patient management software. Our experience of data extracted directly from GP patient management systems suggests that this is far from representative of the true picture. Haas et al. 24 evaluated the performance of six algorithms in predicting outcome measures including inpatient hospitalisation (AUC 0.67-073) and Emergency Department visits (AUC 0.58 to 0.67) within 12 months. While somewhat similar, both of these outcome measures are different enough from the one employed for this study to make direct comparison difficult. On the other hand, efforts such as the Gold Coast Integrated Care in Australia 25 employ a purposely designed Risk of Hospitalisation scoring mechanism for which no validation or performance information is published, making comparison impossible. In general, when comparing similar outcome measures, our predictive risk model performed at a level that was similar to, if not better than, other models that did not include previous hospitalisation as a predictor variable.
As with other studies of this type, there are limitations to this work. We suspect that model performance was constrained by primary care data quality issues, especially data correctness, inconsistencies, and completeness. Significant effort was devoted to data preparation to ensure quality issues (e.g., data entry issues, unit inconsistencies) were appropriately handled, derived predictors were calculated in a consistent manner, and steps were taken to reduce missingness where possible (e.g., calculating BMI where height and weight could be found in the data). Nevertheless, missing data was a feature in several key variables (BMI, ethnicity, alcohol consumption, smoking history, index of advantage/disadvantage). The expected use for our model is in GP clinics using computerised systems similar to those from which the sample data was obtained. Thus, we anticipate similar patterns of missingness in the production environment.
Statistical approaches to missing data include complete case analysis and multiple imputation. We avoided complete case analysis due to concerns about introducing bias. Multiple imputation is a common approach but requires an assumption that the data is "missing at random" and development of additional models to impute missing data, possibly using additional covariates. We deliberately used a different approach, creating an unknown or "not recorded" category for each of these categorical predictors. We believe that in many cases the missingness reflects the nature of the missing value and possibly the reason for the GP visit. In short, we believe the data is missing not at random and has predictive power of its own. An advantage of our approach is that it easily extends to a production environment in which it is necessary to make predictions with data that similarly exhibits missingness. Unfortunately, the approach can also introduce bias. In our estimates that bias is small, but it means care should be exercised concerning the interpretation of model coefficients and standard errors for those predictors in particular. What remains clear is that complete data is preferable to missing data for these kinds of models, and we hope that data completeness in these key predictors increases.
Due to limitations in recording processes and available data fields, clinically relevant semantics of medication and diagnosis variables were not considered. For example, mode of delivery and dosage were not incorporated into medication predictors, diagnosis variables did not distinguish between treated and untreated conditions/diseases, and medications were not ascribed to particular conditions/diseases. We leave the study of such predictors for future work. Also, our choice to include potentially preventable hospitalisations in the response variable was guided by studies 11,12 recommending the targeting of potentially preventable hospitalisations through integrated  www.nature.com/scientificreports www.nature.com/scientificreports/ care and other reform measures. We recognise that not all such hospitalisations are actually preventable in the short-term through management in primary care 10 . In practice, the addition of clinical assessment of patients with high probability of hospitalisation would allow further tailoring of primary care appropriate to patient needs.
Our prediction model provides a validated tool for risk stratification in GP patient populations and is deployed within a software suite used in the Health Care Homes trial. It is implemented in a number of general practice clinics (up to 200) within a web application which sources data from five different general practice information systems and uses the model to identify patients at high risk of hospitalisation over the following 12 months. Combined with other elements of the Health Care Homes trial, which include a clinical assessment to assign identified patients to tiers and a patient-centric integrated care program, it is hoped that this model will help deliver significant improvements to patient outcomes while reducing the burden of chronic conditions on the Australian health system.

Data Availability
The datasets analysed in the current study are not publicly available. Due to reasonable privacy and security concerns and requirements imposed by the ethics approval process, they are not redistributable to researchers other than those engaged in the ethics committee approved research protocol.